I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Monday, January 28, 2008

Applications Vs ASPs


Friday, March 23, 2007

What to look for when you hire a data warehouse engineer

  • Full lifecycle experience in Business analysis, design, development and operations of terabyte size data warehouses in star schema. Must have worked on atleast 3 seperate data warehouses.
  • Experience in ETL architecture design and development using industry standard ETL tools (any of ab initio, Informatica, Data Stage or SSIS)Experience in Relational & Dimensional Data Modeling.
  • Ability to reverse engineer and understand existing data models.Familiarity with Database, Software and hardware issues / architecture needed to design and develop a terabyte size data warehouse (On UNIX AND Windows Platforms, ORACLE and SQL SERVER)
  • Understanding of data accuracy and quality and the ability to backtrack and solve data issues
  • Nice to have:Experience in Cube design and development using SSAS, Hyperion or any other industry standard tools.

Sunday, March 18, 2007

Why do Most data warehouses Fail?

  1. Datawarehouse is treated like any other application in the company and same people put it through the same process that worked for other applications.
  2. Lack of understanding of the mess that the application sources to data warehouses are in and an optimistic expectation of data quality.
  3. Lack of understanding of the columes of data involved and the false expectations that the vendors set. Yes SSAS will work for this volume we were told. But in reality a vastly toned down version with 10 % of requirements will work on a 64 cpu box we found after months of work.
  4. .....

Why are programmers creative in creating data subtilities

Every time we touch a dataset we are told "in this case x means y else x means z only if x = 999 or else z means -z if.....". Or in plain english a pink cat is a dog unless its not alive (as indicated by its age) then its a toy and its age is -999.

I was told by one progammer the reason for such convolutions in data and he does have his point. "I write the smallest, easiest to maintain most efficient to run code..... I am not writing extra code to make sure the data is clean and meaningful" (I paraphrased).

Thursday, March 1, 2007

Basics of Dimensional Modeling

A data warehouse answers a particuler set of business questions. The questions will have variables like sales, profit, location. If you think deep you will recognise that there are two types of variables - those that represent or measure the business and those that are parameters that influence thew business. The former are facts and the latter are dimensions. Or to be more clear the former are attributes of Fact Entity (s) and the latter are Dimension Entities. - NOTE the difference in the defnition. This is crucial to arrive at a simple dimensional data warehouse data model. The data model is the heart of the data warehouse - NOT the ETL.

Wednesday, February 21, 2007

BI in its own network domain?

A very significant piece of effort into BI initiatives go into putting up proper infrastructure in place to handle massive flows of data. Further the security and other requirements are different. It is not uncommon to see BI Hardware being shoe horned into current Network and Hardware setup, with CORP, DeV, QA and Prod thus forcing BI to deal with the security and other policies that are alien to it. e.g.: Any tool installation requires corp approval, handling and management. They don't have the skillset to install, maintain or manage the tool.

A way out is to create a BI domain and place a firewall around it but let policies inside the domain to be more lax.

Monday, February 19, 2007

How many dimensions should I have in my Data Warehouse?

I can tell how many you should not have - 25, 50 ,100 or more. Because I refuse to believe there are 100 significant entities by which you are analyzing your business. To understand this lets start off with the definition of a dimension.

A dimension is an ENTITY by which you analyze your business. Even 6 months back I used to believe a dimension is a Variable by which you analyze your business. Then I happened to work on projects with more than 100 dimensions tables. I realized the scope for misunderstanding in the definition.

What is an Entity and what is a variable or an attribute? A customer is an entity and a customer name is an attribute. Customer state, Customer Zip, Customer Type are some more attributes of same entity. The Dimension is Customer. You'll have 100 dimensions in your data warehouse only if you create a dimension for each attribute - e.g. Dim_Customer_Type. Or Dim_Tick_Type. Take your 100 + dimensions and make entities out of them. You'll end up with 10 -15 entities or less. The name of the game is to get to right level of abstraction.

When you have 100 + dimensions it becomes impossible to develop build and manage the system. There is no ROI. Your resources don't increase just because you need them. YOur resources are what you can afford. Insted of beating your resources to the ground whether people or machines, proper design and a simple data warehouse that satisfies the requirements should be built.