I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Sunday, February 18, 2007

The Data warehouse at Web MD

Every data warehouse is a different and learning experience. Working with vendor supplied application (SAP) or stable in house application packages as data sources different experience from working with several short and long term constantly changing data sources. Working with WebMD has expanded the domain of data warehousing for me.

Data warehousing is usually described in terms of Data sources, the extraction of data from those data sources, their transformation and integration into a logically consistent data set and further the load into the database and hence the merge with already existing data. Easy access is provided to this huge data set, something operational systems cannot do at this scale. Data warehouse is the Data, the Meta data, the software, the hardware, the processes, the procedures, the operations and policies that enable this.

WebMD is a Heath care information website with advertising and sponsor based revenue model. The information they are looking for from the data warehouse is about the user behavior on site. e.g.: top pages viewed, top pages viewed after a certain page. They look both at aggregates and in very specific events. The primary data sources are the web logs. The secondary ones are several data based that collect data based on web forms. e.g.: registration. The web logs are physically one data source. The format is one but the data structure is not. Logically web logs can be looked at as common format interface to several disparate data sources. Each of the fields of web log is just a collection of several different other fields e.g. a query string may contain several dozen fields passed to us as name value pairs. These logical data sources are constantly changing as the web site is ever evolving with added functionality or changing business partnerships. The set of use cases of a Point of Sale data source is known. The set of use cases of a web site is unknown and changing. The generalized analysis requirements of the former can be known. For the latter its impossible. Our data warehouse is built to accommodate these fast changes. It is very flexible while retaining the ease, scale and access of the other data warehouses I worked on.

Our extraction process consisted of several data movement scripts, a c# parser which reformats the input data source and does some basic transformations, A number of sql based stored procedures performing aggregations and transformations managed by SQL Server’s DTS (It is not strictly an ETL tool which automates code development. Rather it provides a management shell where you write your own code). Loading is managed by DTS. The database is SQL server which is not common for a 2 tera byte data warehouse and that restricts us to me smart in our data model and the way we manage the data bases that make up the data warehouse. Data access is provided by Business Objects tool. Business objects also provides a layer of insulation for analysis from the data model. We can change the data model without affecting the validity of the hundreds of analyses on it.

Over time our role and understanding of the data warehouse has expanded to include in addition to requirement analysis 1) to understand the business problem being solved at the application requirements level, to make sure that the application developers are developing applications and recording data in a way that further down the line the data is accurate and logically consistent at the data warehouse 2) to understand the end user analysis and analytics and to make sure that their analysis is consistent when our data changes. The first role is that of an enterprise data architect and the second of a data liaison with the Analytics and user community.
I have worked in several full life cycle data warehouses before from conception of data warehouse to its actual use in production. This data warehouse redefines the concept to include data from conception to use. Being part of a small team of four, and the senior most member of the group I have worked on almost every aspect of the data warehouse hands on, either in coding or in helping others code or in checking their code