Why do we build data warehouses?
A) To streamline, consolidate and make repeatable information generation and delivery.
B) To facilitate speedy, accurate and repeatable exploratory data analysis by non technical users which still has Business Integrity
C) To streamline, consolidate and make repeatable statistical data analysis efforts
In A) I have control over the information and hence can ensure the Business Integrity. In C) I am working with experts in statistical analysis and am just providing them a platform (data & technical) for their analysis.
The troublesome piece is B). This is where the innovation of Ralph Kimball comes in. He has developed a structure for data that is simple without sacrificing accuracy. This dimensional model assigns a transaction to one and only one point in multi dimensional space. The dimensions themselves are modeled after Cartesian system – orthogonal with each dimension having a hierarchy of scales defined. A point in multi dimensional space can have multiple events. Properties of these events are represented as measures at that point. These measures can be aggregated (simple sum, count or more complex functions) to define derived measures at each scale. The calculation of measures of a scale from events or measures at lower scale are non trivial and needs to be handled carefully.
It is the data warehouse architect’s goal to fit the data into the scheme described above. While extensions and innovations are always welcome they must be weighed against the requirement to ensure Business Integrity of exploratory data analysis without having to hire a statistician to look over their shoulder or in case of Data Gyms the programmer’s shoulder.
I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.