I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Thursday, August 14, 2008

Facts, measures, aggregations and cubes …….

In my last blog post I quickly skipped from facts to measures to aggregations. Let me make this technique clearer.

A business transaction can be represented by a point in business-dimensional space. The point has properties assigned to it based on the business transaction. In BI a single point is not of much use. You are more looking at trend as an indicator of what has happened and forecast what will happen and plan for it. One method of finding trends out of individual events with properties in multi dimensional space is to model the business with parameters and fit the model to the data to get the parameters. Another alternate way is to define statistical quantities in small units of the dimensions (instead of points). These statistical quantities are measures. They can be as simple as sum or distinct count or much more complex like EBIDA. Going from individual events to statistical quantities not only reduces the size of data but also its complexity – several hundred stock prices’ trend is encapsulated into one DJIA number that signifies the entire market.

Now the units of dimension are not universal or fixed. Infact several nested units can be defined for same dimension. At different scale you may notice different trends in the business as discovered in Data. NASDAQ may be trending down but NASDAQ 100 may be trending up. Or commodities Index may be trending up while DJIA is crashing. However it is non trivial to define and measure statistical quantities from one unit of dimension to another – or in other words move measures from one level in dimension hierarchy to another. Such techniques are called aggregations.

Its also possible to define aggregations when the dimensional space is projected into a space with fewer dimensions – e.g. You take location out of consideration and analyze business performance against other dimensions. Again this is non trivial to define and project measures onto such projections

There are always a large number of projects that can be defined in a multi dimensional space. If you go from 6 dimensional to 4 dimensional space there are 15 different projects possible. All of them may not have business relevance. A business may still have 5-20 relevant dimensional projects – also called aggregate tables of various reductions (In number of dimensions). Still it is too complex to maintain and defeats the purpose of data warehousing as a data deliverable to end users.

Solving both the dimension projection, and Dimensional Scale change problems is the invention of data cubes or multi dimensional databases. Multidimensional databases store data with measures already calculated (or capable of automatically calculating) against all dimensional scales and dimensional projects – ie hierarchies and slicing and dicing. They further provide the capability of drilling through – or go from larger to smaller dimensional scale or even to individual business transactions.
Its every cool if you know what you are doing. If not you can get burnt very fast and not know about it – too many companies are discovering that their MDX programmers are building Data Gyms of unknown accuracy instead of Cubes. That will be subject of future blog