Introduction to Data Warehousing - unibz

Introduction to Data Warehousing - unibz

Data Warehouse Models and OLAP Operations Enrico Franconi CS 636 Data Warehouse Architecture CS 336 2 Decision Support Information technology to help the knowledge worker (executive, manager, analyst) make faster & better decisions What were the sales volumes by region and product category for the last year? How did the share price of comp. manufacturers correlate with quarterly profits over the past 10 years? Which orders should we fill to maximize revenues? On-line analytical processing (OLAP) is an element of decision support systems (DSS) CS 336 3 Three-Tier Decision Support Systems Warehouse database server Almost always a relational DBMS, rarely flat files OLAP servers Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators

Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional data and operations Clients Query and reporting tools Analysis tools Data mining tools CS 336 4 The Complete Decision Support System Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) Clients (Tier 3) e.g., MOLAP Semistructured Sources Data Warehouse extract transform

load refresh etc. Analysis serve Query/Reporting serve e.g., ROLAP Operational DBs serve Data Mining Data Marts CS 336 5 Data Warehouse vs. Data Marts Enterprise warehouse: collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization Requires extensive business modeling (may take years to design and build) Data Marts: Departmental subsets that focus on selected subjects Marketing data mart: customer, product, sales Faster roll out, but complex integration in the long run

Virtual warehouse: views over operational dbs Materialize sel. summary views for efficient query processing Easy to build but require excess capability on operat. db servers CS 336 6 Approaches to OLAP Servers Relational DBMS as Warehouse Servers Two possibilities for OLAP servers (1) Relational OLAP (ROLAP) Relational and specialized relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) Array-based storage structures Direct access to array data structures CS 336 7 OLAP Server: Query Engine Requirements Aggregates (maintenance and querying) Decide what to precompute and when Query language to support multidimensional operations Standard SQL falls short Scalable query processing Data intensive and data selective queries

CS 336 8 OLAP for Decision Support OLAP = Online Analytical Processing Support (almost) ad-hoc querying for business analyst Think in terms of spreadsheets View sales data by geography, time, or product Extend spreadsheet analysis model to work with warehouse data Large data sets Semantically enriched to understand business terms Combine interactive queries with reporting functions Multidimensional view of data is the foundation of OLAP Data model, operations, etc. CS 336 9 Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators

CS 336 slice & dice roll-up, drill down pivoting other 10 Multi-Dimensional Data Measures - numerical data being tracked Dimensions - business parameters that define a transaction Example: Analyst may want to view sales data (measure) by geography, by time, and by product (dimensions) Dimensional modeling is a technique for structuring data around the business concepts ER models describe entities and relationships Dimensional models describe measures and dimensions CS 336 11 The Multi-Dimensional Model Sales by product line over the past six months Sales by store between 1990 and 1995 Store Info Key columns joining fact table Numerical Measures to dimension tables Prod Code Time Code Store Code Sales

Fact table for measures Product Info Dimension tables Qty Time Info ... CS 336 12 Dimensional Modeling Dimensions are organized into hierarchies E.g., Time dimension: days weeks quarters E.g., Product dimension: product product line brand Dimensions have attributes CS 336 13 Dimension Hierarchies Store Dimension Product Dimension Total

Region District Stores CS 336 Total Manufacturer Brand Products 14 ROLAP: Dimensional Modeling Using Relational DBMS Special schema design: star, snowflake Special indexes: bitmap, multi-table join Special tuning: maximize query throughput Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

CS 336 15 MOLAP: Dimensional Modeling Using the Multi Dimensional Model MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products Pilot, Arbor Essbase, Gentia CS 336 16 Star Schema (in RDBMS) CS 336 17 Star Schema Example CS 336 18

Star Schema with Sample Data CS 336 19 The Classic Star Schema A single fact table, with detail and summary data Fact table primary key has only one key column per dimension Each key is generated Each dimension is a single table, highly denormalized Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Level Fact Table STORE KEY PRODUCTKEY

PERIOD KEY Dollars Units Price Product Dimension PRODUCTKEY Product Desc. Brand Color Size Manufacturer Level Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem CS 336 20 The Classic Star Schema

Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Level Fact Table STORE KEY PRODUCTKEY PERIOD KEY Dollars Units Price Product Dimension PRODUCTKEY Product Desc. Brand Color Size Manufacturer Level Time Dimension PERIOD KEY Period Desc Year Quarter

Month Day Current Flag Resolution Sequence The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error. Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A where A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = North and Level = 2) and CS 336 etc... Level is needed whenever aggregates are stored with detail facts. 21 The Level Problem Level is a problem because because it causes potential for error. If the query builder,

human or program, forgets about it, perfectly reasonable looking WRONG answers can occur. One alternative: the FACT CONSTELLATION model... CS 336 22 The Fact Constellation Schema Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Fact Table STORE KEY PRODUCTKEY PERIOD KEY Dollars Units Price Product Dimension PRODUCTKEY Product Desc. Brand

Color Size Manufacturer CS 336 Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Sequence District Fact Table District_ID PRODUCT_KE Y PERIOD_KEY Dollars Units Price Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price 23

The Fact Constellation Schema Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Fact Table STORE KEY PRODUCTKEY PERIOD KEY Dollars Units Price Product Dimension PRODUCTKEY Product Desc. Brand Color Size Manufacturer Time Dimension PERIOD KEY Period Desc Year Quarter

Month Day Current Flag Sequence District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, for example, Store detail when querying the District Fact Table. Major Advantage: No need for the Level indicator in the dimension tables, since no aggregated data is stored with lower-level detail Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which

requires more extensive metadata CS 336 24 Another Alternative to Level Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay. An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the Snowflake Schema ... CS 336 25 The Snowflake Schema Store Dimension STORE KEY Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. District_ID

Region_ID District Desc. Region_ID Region Desc. Regional Mgr. Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price CS 336 District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price

26 The Snowflake Schema Store Dimension STORE KEY District_ ID Region_ ID Store Description City State District ID District Desc. Region_ ID Region Desc. Regional Mgr. District Desc. Region_ ID Region Desc. Regional Mgr. Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price

District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Unit s Price No LEVEL in dimension tables Dimension tables are normalized by decomposing at the attribute level Each dimension table has one key for each level of the dimensions hierarchy The lowest level key joins the dimension table to both the fact table and the lower level attribute table How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table. CS 336 27 The Snowflake Schema

Store Dimension STORE KEY District_ ID Region_ ID Store Description City State District ID District Desc. Region_ ID Region Desc. Regional Mgr. District Desc. Region_ ID Region Desc. Regional Mgr. Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price District Fact Table District_ID PRODUCT_KEY PERIOD_KEY

Dollars Units Price RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Unit s Price Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content. In practice, start with a Star Schema and create the snowflakes with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table. Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database CS 336 28 Advantages of ROLAP Dimensional Modeling Define complex, multi-dimensional data with simple model Reduces the number of joins a query has to

process Allows the data warehouse to evolve with rel. low maintenance HOWEVER! Star schema and relational DBMS are not the magic solution Query optimization is still problematic CS 336 29 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale CS 336 prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2

date 1 1 1 1 2 2 amt 12 11 50 8 44 4 81 30 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale CS 336 prodId p1 p2 p1 p2 p1

p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 ans date 1 2 sum 81

48 31 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2

amt 12 11 50 8 44 4 sale prodId p1 p2 p1 date 1 1 2 amt 62 19 48 rollup drill-down CS 336 32 Aggregates Operators: sum, count, max, min, median, ave

Having clause Using dimension hierarchy average by region (within store) maximum by month (within date) CS 336 33 ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing CS 336 34 The MOLAP Cube Fact table view: sale prodId p1 p2 p1 p2 storeId s1 s1 s3

s2 Multi-dimensional cube: amt 12 11 50 8 p1 p2 s1 12 11 s2 s3 50 8 dimensions = 2 CS 336 35 3-D Cube Fact table view: sale prodId

p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 Multi-dimensional cube: date 1 1 1 1 2 2 amt 12 11 50 8 44 4 day 2 day 1

p1 p2 s1 p1 12 p2 11 s1 44 s2 4 s2 s3 s3 50 8 dimensions = 3 CS 336 36 Example roll-up to region Product or t

S NY e SF LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 56 M T W Th F S S Dimensions: Time, Product, Store roll-up to brand Attributes: Product (upc, price, ) Store Hierarchies: Product Brand Day Week Quarter

roll-up to week Store Region Country Time 56 units of bread sold in LA on M CS 336 37 Cube Aggregation: Roll-up day 2 day 1 p1 p2 p1 p2 p1 p2 s1 44 s2 4 s1 12 11 s1 56

11 s2 s3 50 8 s2 4 8 rollup drill-down CS 336 s3 Example: computing sums ... s3 50 sum s1 67 s2 12 s3

50 129 p1 p2 sum 110 19 38 Cube Operators for Roll-up day 2 day 1 p1 p2 p1 p2 p1 p2 s1 44 s1 12 11 s1 56 11 s2

4 s2 ... s3 50 sale(s1,*,*) 8 s2 4 8 sale(s2,p2,*) CS 336 s3 s3 50 sum s1 67 s2 12 s3 50

129 p1 p2 sum 110 19 sale(*,*,*) 39 Extended Cube * day 2 day 1 CS 336 p1 p2 * p1 p2 s1 * 12 11 23 p1

p2 * s1 44 s2 44 8 8 s1 56 11 67 s2 4 s3 4 50 50 s2 4 8 12 s3 * 62 19 81 s3 50 50 *

48 48 * 110 19 129 sale(*,p2,*) 40 Aggregation Using Hierarchies day 2 day 1 p1 p2 p1 p2 s1 44 s1 12 11 s2 4 s2 s3 s3

50 8 store region country p1 p2 CS 336 region A 56 11 region B 54 8 (store s1 in Region A; stores s2, s3 in Region B) 41 Slicing day 2 day 1 p1 p2 s1 p1 12

p2 11 s1 44 s2 4 s2 s3 s3 50 8 TIME = day 1 p1 p2 CS 336 s1 12 11 s2 s3 50 8

42 Slicing & Pivoting Products Store s1 Store s2 Electronics Toys Clothing Cosmetics Electronics Toys Clothing Cosmetics Products Store s1 Store s2 CS 336 Electronics Toys Clothing Cosmetics Electronics Toys Clothing Sales

($ millions) Time d1 d2 $5.2 $1.9 $2.3 $1.1 $8.9 $0.75 $4.6 $1.5 Sales ($ millions) d1 Store s1 Store s2 $5.2 $8.9 $1.9 $0.75 $2.3 $4.6 $1.1 $1.5 43 Summary of Operations Aggregation (roll-up) aggregate (summarize) data to the next higher dimension element e.g., total sales by city, year total sales by region, year Navigation to detailed data (drill-down) Selection (slice) defines a subcube e.g., sales where city =Gainesville and date = 1/15/90

Calculation and ranking e.g., top 3% of cities by average income Visualization operations (e.g., Pivot) Time functions e.g., time average CS 336 44 Query & Analysis Tools CS 336 Query Building Report Writers (comparisons, growth, graphs,) Spreadsheet Systems Web Interfaces Data Mining 45

Recently Viewed Presentations

  • Clickers Technology in the Classroom

    Clickers Technology in the Classroom

    Clickers in the classroom Implementing Peer Instruction in Cegep Nathaniel Lasry, PhD Physics Dept, John Abbott College Div of Engineering & Appl. Sc, Harvard
  • Regional Coordination on Sand & Sediment Resources Panelists

    Regional Coordination on Sand & Sediment Resources Panelists

    Sediment Budget and RSM. Limited quantity and quality - within currently authorized borrow areas. New Sources - Further away. Sediment Budget Management within Project. Does the project need to be re-evaluated. Additional coordination with Resource agencies.
  • AP Psych OBJECTIVE Understand the main concepts, as

    AP Psych OBJECTIVE Understand the main concepts, as

    Does intelligence stay stable over a lifetime or change? Are individuals on the two extremes of the intelligence scale really different? If a 6 month old seems to developing more slowly and is not as playful as other infants her...
  • Access the Defense Equal Opportunity Management Institute at

    Access the Defense Equal Opportunity Management Institute at

    Requesting a DEOCS. Go to . www.DEOMI.org. Click on "Organizational Assessment" Click on "DEOMI Organizational Climate Survey (DEOCS)" S:\Joint Groups\EO\Army Program\Training\References
  • Safeguarding and Personalisation - the legal issues

    Safeguarding and Personalisation - the legal issues

    I say genuine, because of the Daynes case - more on this, soon. Where the council was, and still is the purchaser, and it's just the provider who is different; or the client was the purchaser, and now the agency...
  • E E 2315 Circuits I Introduction - University of Texas at ...

    E E 2315 Circuits I Introduction - University of Texas at ...

    E E 2315 Circuits I ... write exams at workplace On-campus students write exams here Homework Submission Get your homework assignment from web Print from your browser to get hardcopy Work out the problems ask for help if needed see...
  • Instant Recess Activities in the Regions (to date)

    Instant Recess Activities in the Regions (to date)

    All week - Wellness champions leading facility-specific events each day to best fit operations and patient care needs. To learn more about activities at your facility, contact your local Healthy Workforce or Total Health champion. All week - Instant Recess...
  • Cranial Nerves - Examnnotes.com

    Cranial Nerves - Examnnotes.com

    Cerebral Cortex Two hemispheres Separated by longitudinal fissure Right and Left hemispheres Connected by Corpus Callosum Outer gray matter - neurones Inner white matter - nerve fibres, and neuroglia Sulci and gyri Frontal parietal temporal and occipital lobes Brain Stem...