I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Sunday, February 18, 2007

Data Warehouse Design Standards

Aparna Chamerthi & Vijay K Nadendla

File Sources: Source pushes file to data Collector. If it pushes somewhere else it is moved by Ab Initio to Data Collector

Data Sources: Tables are copied whole. Larger tables have the incremental data copied if possible.

Data Collector: A database dimensional / small tables & MFS for fact data that is extracted from Data Sources / file Sources. Minimal transformations. One Data Collector for all Data Warehouses. Proceses in to the Data Collector are included in Ab Initio Data Collector Project.

Stage: All data is integrated & transformed only here. A single periods data is usually processed. It will contain structures similar to the data sources, the final dimensional model similar to data warehouse and intermediate tables if necessary. All dimension / look up tables are copied daily from data warehouse and reconciled with the source systems. There is a different stage per DW. Processes in and out of stage db are included in Ab Initio Stage Project.

ODS: An Operational Data Store for quick short term reporting. Not much processing is involved. In Ab Initio ODS project will include all processes coming into ODS db.

Data Warehouse: The data is stored in lowest incoming grain and in dimensional format. It contains the master tables for dimensions and lookups. Extracts and some aggregates are done on data warehouse. Processes out (and out and in) of Data Warehouse are part of DW Ab Initio Project.

Data Mart: Subject / question oriented small data repositories. Processes in and out of Data Mart are part of Data Mart Project.

ETL[AX] = Extract, Transform, Load [Aggregate, eXtract]. This is a single process that runs daily at a certain time (and for a certain time) and in that order. Processes cannot be mixed. e.g. All Extracts of a data warehouse must occur before transforms occur.

Slowly Changing Dimensions: Lastupdatetime is added to all the dimensions. If any dimension record changes a new record is inserted with new values. Old record stays as is.

Conform Dimensions: IDT’s conform dimensions are D_Date, D_Customer, D_Access_Point …

Lookup Tables: DW Group maintains the lookup tables and their values.

Facts: Multiple source data are integrated into single fact table. Usually there is only one fact table. Completely independent data with non overlapping dimensions from different events and with almost no occasion of joins between fact tables can be separated resulting in multi star schema. E.g. sales transactions and CDRs

Aggregates: Aggregates are multidimensional. They are not for a single report.

Summary Tables: Maybe built in users schema.