I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Monday, January 19, 2009

One cube or many

You define multiple cubes only in these cases
1) Security becomes easier to handle when there are different cubes for different users
2) YOu can split the business dimensional space into two sub spaces of less dimensionality, with few dimensions overlapping between them
3) You have different sets of measures that are local in different cubes although the business dimensional space is similar. IN this case you can even have the same cube but different measure group.

Vijay

Local Measures in Business Dimensional Space

Few Posts down I have decribed the Business dimensional space forming the basis for Microsoft Cubes. They are repeated next. (0-4) 4-6 I have defined how to place the measures in the business dimensional space. In this post I extend the scheme to define and use local measures.

What is a local measure: A local measure is completely defined as a function of the dimensional coordinates of the cell it is contained in AND this property extends across all hierarchies that the cell is part of. i.e. a measure profit of two dimensional cube Date and Store, is local because it is a function of a particular store on a certain day (profit of a store in a day), AND a function of a Particular Division the store is part of and the month the day is part of (profit of a division in a month).

Use of non local measures considerably complicates the retrieval of data from the cube and defeats the purpose of the cube. A similar result can be acchieved in that case by writing programs on raw / aggregated data.

Another Example: A hedge fund gets data as delta positions. Storing the delta positions in the cube results in a simple measure - current value being a non local measure resulting in the development of a program in MDX to browse the entire cube. This is a trivial Local measure of delta positions are converted into positions before loading into a cube. However for a measure transactional costs delta positions may be local.

This basically defines when you have one cube and when you have many cubes for a particuler business problem. Thats the subject of next blog

Scheme for defining Cubes:

0) Define the space you are going to work in ie next 123
1) Identify dimensions as entities holding business variables. Is the problem I am looking to measure / model contain this variable as an independent parameter - not is this artifact of data representation. So lookups, data processing artifacts of what ever kind are not dimensions. Next question is are all these variables independent or can they be grouped as abstract entities thereby decreasing my dimensionality - keep in mind world is not 100 dimensional
2) Ensure your dimensions are orthogonal - ie there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
Now you have defined and placed the data properly in MS BI Cubes. Remember you still need to verify results against data points since error caluculation and propagation is not part of above scheme. You may incorporate it into above but it may be non trivial.