DataSense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Saturday, August 30, 2008

How and why are hundred dimension data warehouses designed? Is it the Birth of Object Dimensional Modeling.

As follow up on my previous blogs on the need for and process to design a simple dimensional model that can satisfy the analysis requirements, I have investigated how and why are 100 dimension monsters built. I acknolwedge conversations with colleagues on this subject.

Why
1) Sponsors of data warehouse want every single attribute from all data sources be stored in the data warehouse. This is the big bang approach - build once for all future. Users also don't want to be told a particuler attribute is not present in the data warehouse after its built. So safest approach is to store all.

2) Users are used to the data elements as they worked with during the business processes. Abstraction into new variables with take away the familiarity and there is reluctance.

3) Designers were never exposed to variable reduction using abstraction and hence don't consider it as an option. Designers have an improper understanding of what a dimension is - they slap dim infront of any source table name and assume its dimension of data warehouse. Further they perform horizontal partitioning of data (very OO)for performance reasons.

How:

1) Source databases' data models are copied into one structure. Each source (or a small set of sources) becomes a subject area.

2) Tables that completely (or very nearly overlap) between data sources are merged.

3) The purpose of each table is investigated. If it is a transaction - it becomes a fact table (2 or more tables can be merged). If it is a reference / lookup / business process and such non transaction table it become a dimension. Conform dimensions are set up between subject areas.

4) Other tables / data are merged into the identified tact / dimension tables. SCD properties are identified and metadata added. Load process metadata process tables / attributes are added.

This process reminds me of the creation of a collage and the end effect is very much the same. A very fragmented, but tightly knit data structure.

As my colleagues mentioned this is the bottom up approach - my interpretation of bottom up approach is different and I'll make it subject of future blog.

My Comments:

1) I look at this as data / database consolidation project not a data warehousing project.

2) I am not sure if there is any advantage of moving to dimensional modeling in such cases. Why not use 3NF in cases where you need to build a complete data repository for the enterprise. I am not sure 100 dimensional data web is more understandable, accurate or faster than 3 NF model. It fact it is the opposite.

3) Some techniques smell similar to ORM. And given the fact that most sources are already in Object models; are we looking at birth of ODM - Object Dimensional Mapping?

4) Data access software and tools should take into account the above structures. Significant ammount of code needs to be written to analyze data in above scenario. This looks very much like stove pipe analysis scenario. Doing stove pipe analysis and maintaining code in MDX is non trivial for example.

Vijay

Wednesday, August 20, 2008

Microsoft BI and CubeFilesystem

This is my real experience analyzing several Microsoft BI implementations – especially in Financial Institutions.

Programmers have been used to using file system as a place where they store and retrieve data in raw form, and they do significant logic in the code in (c++, Java etc) on top of it to get the required functionality. When given a new toy, MDX they used it exactly the same way – as a filesystem where they can build the logic on; instead of business-dimensional space where they should build the logic in! It is the structure of the dimensional space, and in its various measurements, and in definition of scales, in logic that is embedded in moving measures from scale to scale and in defining measures when all you have is data points that they should be concentrating on!

Not that its wrong. But why do you need to write a program in MDX. You may as well use Java, and aggregate data structures landed to disk if that is how you want to proceed.

Aug 21: Update

The last few blogs have led to some emails from friends asking me if what they are doing isn't apt use of the tool what then is proper use of the tool. I refered them to by previous blogs but promissed them to summarize them here
0) Define the space you are going to work in ie next 123
1) Identify dimensions as entities holding business variables. Is the problem I am looking to measure / model contain this variable as an independent parameter - not is this artifact of data representation. So lookups, data processing artifacts of what ever kind are not dimensions. Next question is are all these variables independent or can they be grouped as abstract entities thereby decreasing my dimensionality - keep in mind world is not 100 dimensional
2) Ensure your dimensions are orthogonal - ie there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
Now you have defined and placed the data properly in MS BI Cubes. Remember you still need to verify results against data points since error caluculation and propagation is not part of above scheme. You may incorporate it into above but it may be non trivial.
Enjoy

Tuesday, August 19, 2008

Where the mind is free......

Sometimes an intractable problem seems so only because it is misclassified. When you take a step back, look at big picture and think it is no longer an issue.
Case to point is a data problem that some hedge fund companies mentioned - how do you define an slowly changing dimension to take care of the issue where there are multiple identification scheme for the same security and each transaction uses one more more of these schemes. These schemes have non trivial many-many relationship witheach other. This is very much a back office problem.
But for Market analysis does it matter? Would storing it as a junk dimension be OK. Can we run association algorithmns on it to group records with different set of security identifiers together into one record as at some level they represent same security. Or in other words would the information contained in grouped security identifiers be enough for analysis. Hmmm once we think it became a simple Junk dimension + Data Mining problem.
Where the mind is free........

Friday, August 15, 2008

The Hedge Fund P&L Problem

I have talked to a few hedge funds over last couple of months and this is a common problem that they face when implementing Microsoft BI solutions without a deep understanding of dimensional modeling. They are looking for MDX experts to help them dig out of the hole, not recognizing that it’s a design issue.
Here is the simplified version of the problem. A Hedge fund owns m(n) quantities of n number of securities. Every day they can sell or buy a few more – both the quantities of security held (i.e. 100 of IBM), or number of various securities (IBM, MSFT, HLTH…) may change. The values of the securities themselves may change (lets simplify by assuming mark to market). They get a feed of daily transactions that were closed. They need to make daily (or periodic) Profit and Loss estimate. This could influence market communications and security transaction decisions. I assume accounting may make estimates of those. But you maybe looking to Build a SSAS cube in addition to study P&L by various dimensions to improve your business, or to build an executive dashboard.
Please read the last two blogs to understand the underpinnings of Dimensional modeling. I’ll build on those to explain how we model the above problem. First let me explain how it was modeled wrong in several places I am aware of.
Take the incoming purchases / sales as transactions / facts. Identify dimensions (made mistakes here but that’s beside the point). Build and SSAS cube out of those transactions. Now start writing MDX programs that deduce P&L based on those transactions. Its not that they didn’t understand the problem. They understand how to calculate P&L based on the data in hand. But they are using Cubes as filesystem and Writing MDX programs on it. If you want to do that use Java. Why struggle with MDX. It’s not that the business is complex. It’s just the design is wrong. The measures you are storing at base level of cube are wrong.
What does your car speedometer show you? It’s the speed of the car at that instant. Not acceleration not velocity – just the speed. Imagine of it showed acceleration and you constantly have to calculate speed in the back of your mind by adding (or subtracting) from initial speed of 0.
So what should be right measures in Hedge Fund P&L Business-Dimensional Space. That depends on the understanding of the problem. Don’t confuse sales transactions with Business events in Business-Dimensional Space. What we are trying to analyze in P&L is the Holdings – not the sales. Sales are just changes in the holdings i.e. a derived measure / calculation in the cube. Moving measures to reflect Holdings, not sales will solve the complexity issues with the P&L Cubes. You no longer have to traverse several cells in sales space to deduce P&L. P&L is simple calculation from differences in measures at cell A and B. This will not increase the size of the cube. The size of the cube is dictated by cardinality of the dimensions. It’s just the ETL that needs to reflect the logic of deducing holdings from initial holdings and sales.

And since I am often asked “There is no dimension that can fix the above problem”. A particular facet may be fixed but hack is a hack and the problem shows up elsewhere – in needing complex MDX or unknown data accuracy.

Thursday, August 14, 2008

Facts, measures, aggregations and cubes …….

In my last blog post I quickly skipped from facts to measures to aggregations. Let me make this technique clearer.

A business transaction can be represented by a point in business-dimensional space. The point has properties assigned to it based on the business transaction. In BI a single point is not of much use. You are more looking at trend as an indicator of what has happened and forecast what will happen and plan for it. One method of finding trends out of individual events with properties in multi dimensional space is to model the business with parameters and fit the model to the data to get the parameters. Another alternate way is to define statistical quantities in small units of the dimensions (instead of points). These statistical quantities are measures. They can be as simple as sum or distinct count or much more complex like EBIDA. Going from individual events to statistical quantities not only reduces the size of data but also its complexity – several hundred stock prices’ trend is encapsulated into one DJIA number that signifies the entire market.

Now the units of dimension are not universal or fixed. Infact several nested units can be defined for same dimension. At different scale you may notice different trends in the business as discovered in Data. NASDAQ may be trending down but NASDAQ 100 may be trending up. Or commodities Index may be trending up while DJIA is crashing. However it is non trivial to define and measure statistical quantities from one unit of dimension to another – or in other words move measures from one level in dimension hierarchy to another. Such techniques are called aggregations.

Its also possible to define aggregations when the dimensional space is projected into a space with fewer dimensions – e.g. You take location out of consideration and analyze business performance against other dimensions. Again this is non trivial to define and project measures onto such projections

There are always a large number of projects that can be defined in a multi dimensional space. If you go from 6 dimensional to 4 dimensional space there are 15 different projects possible. All of them may not have business relevance. A business may still have 5-20 relevant dimensional projects – also called aggregate tables of various reductions (In number of dimensions). Still it is too complex to maintain and defeats the purpose of data warehousing as a data deliverable to end users.

Solving both the dimension projection, and Dimensional Scale change problems is the invention of data cubes or multi dimensional databases. Multidimensional databases store data with measures already calculated (or capable of automatically calculating) against all dimensional scales and dimensional projects – ie hierarchies and slicing and dicing. They further provide the capability of drilling through – or go from larger to smaller dimensional scale or even to individual business transactions.
Its every cool if you know what you are doing. If not you can get burnt very fast and not know about it – too many companies are discovering that their MDX programmers are building Data Gyms of unknown accuracy instead of Cubes. That will be subject of future blog

Wednesday, August 13, 2008

Ensuring Business Integrity in Data Warehouses

Why do we build data warehouses?
A) To streamline, consolidate and make repeatable information generation and delivery.
B) To facilitate speedy, accurate and repeatable exploratory data analysis by non technical users which still has Business Integrity
C) To streamline, consolidate and make repeatable statistical data analysis efforts

In A) I have control over the information and hence can ensure the Business Integrity. In C) I am working with experts in statistical analysis and am just providing them a platform (data & technical) for their analysis.

The troublesome piece is B). This is where the innovation of Ralph Kimball comes in. He has developed a structure for data that is simple without sacrificing accuracy. This dimensional model assigns a transaction to one and only one point in multi dimensional space. The dimensions themselves are modeled after Cartesian system – orthogonal with each dimension having a hierarchy of scales defined. A point in multi dimensional space can have multiple events. Properties of these events are represented as measures at that point. These measures can be aggregated (simple sum, count or more complex functions) to define derived measures at each scale. The calculation of measures of a scale from events or measures at lower scale are non trivial and needs to be handled carefully.
It is the data warehouse architect’s goal to fit the data into the scheme described above. While extensions and innovations are always welcome they must be weighed against the requirement to ensure Business Integrity of exploratory data analysis without having to hire a statistician to look over their shoulder or in case of Data Gyms the programmer’s shoulder.