DataSense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Wednesday, September 24, 2008

Wednesday, September 17, 2008

Salient Features of Data Modeling Techniques

Relational: Normalization, Data Application Independence. One type of relation between entities

Object Model: Extensability, reuse, encapsulation. Conditional, Inheritance, association and other kinds of relationships between objects.

Dimensional: Precision, Understandability, Division of variables into measures / facts (Measures of business) and dimensions (parameters of Business Performance). One to many relationships between dimensions and facts. No relationships between dimensions. Hierarchhies within dimensions.

ORM: Mapping Objects onto relational space. Object relationships are built and maintained in application layer

ODM: Mapping Objects onto dimensional space. Multiple dimensional objects per dimension and multiple fact objects are all related to each other, and relationships implemented in reporting layer.

Wednesday, September 3, 2008

My Attempt at creating a hundred dimension data warehouse (Sarcastic)

To actually build a hundred + dimension data warehouse, I got myself to think like an OO programmer. I usually don’t work like this. I intend this to prove to my audience that bad design choices lead to complicated data warehouses. Again my refrain is if you want to write programs use java, and programming constructs. Don’t mix data warehousing and programming paradigms. If you do so you will end up not with Dimensional Model but with ODM – Object Dimensional model.
Vision: Create a data warehouse to store FIX data for analysis.
Data Sources: FIX message Logs. Reference data stored in applications.
Business Analysis: Based on understanding of FIX protocol. There are messages components and fields. Fields are used in messages, components or independently. There are 100+ messages. Taking a cursory look and removing all messages that look like not important, and making a fact table per message I end up with 64 fact tables. (Wow! almost the same number in a messed up data warehouse in a major stock exchange). The list is at the end of this blog.
Finding dimensions is a little bit more complex. There are nearly 1800 fields and 163 components. The trick is to find the fields and components used by the 64 fact tables below.
F_ResendRequest
F_Reject
F_SequenceReset
F_Logout
F_IOI
F_Advertisement
F_ExecutionReport
F_OrderCancelReject
F_Logon
F_News
F_Email
F_NewOrderSingle
F_NewOrderList
F_OrderCancelRequest
F_OrderCancelReplaceRequest
F_OrderStatusRequest
F_AllocationInstruction
F_ListCancelRequest
F_ListExecute
F_ListStatusRequest
F_ListStatus
F_AllocationInstructionAck
F_DontKnowTradeDK
F_QuoteRequest
F_Quote
F_SettlementInstructions
F_MarketDataRequest
F_MarketDataSnapshotFullRefresh
F_MarketDataIncrementalRefresh
F_MarketDataRequestReject
F_QuoteCancel
F_QuoteStatusRequest
F_MassQuoteAcknowledgement
F_SecurityDefinitionRequest
F_SecurityDefinition
F_SecurityStatusRequest
F_SecurityStatus
F_TradingSessionStatusRequest
F_TradingSessionStatus
F_MassQuote
F_BusinessMessageReject
F_BidRequest
F_BidResponse
F_ListStrikePrice
F_XML_non_FIX
F_RegistrationInstructions
F_RegistrationInstructionsResponse
F_OrderMassCancelRequest
F_OrderMassCancelReport
F_NewOrderCross
F_CrossOrderCancelReplaceRequest
F_CrossOrderCancelRequest
F_SecurityTypeRequest
F_SecurityTypes
F_SecurityListRequest
F_SecurityList
F_DerivativeSecurityListRequest
F_DerivativeSecurityList
F_NewOrderMultileg
F_MultilegOrderCancelReplace
F_TradeCaptureReportRequest
F_TradeCaptureReport
F_OrderMassStatusRequest
F_QuoteRequestReject
F_RFQRequest
F_QuoteStatusReport
F_QuoteResponse
F_Confirmation
F_PositionMaintenanceRequest
F_PositionMaintenanceReport
F_RequestForPositions
F_RequestForPositionsAck
F_PositionReport
F_TradeCaptureReportRequestAck
F_TradeCaptureReportAck
F_AllocationReport
F_AllocationReportAck
F_Confirmation_Ack
F_SettlementInstructionRequest
F_AssignmentReport
F_CollateralRequest
F_CollateralAssignment
F_CollateralResponse
F_CollateralReport
F_CollateralInquiry
F_NetworkCounterpartySystemStatusRequest
F_NetworkCounterpartySystemStatusResponse
F_UserRequest
F_UserResponse
F_CollateralInquiryAck
F_ConfirmationRequest
F_ContraryIntentionReport
F_SecurityDefinitionUpdateReport
F_SecurityListUpdateReport
F_AdjustedPositionReport
F_AllocationInstructionAlert
F_ExecutionAcknowledgement
F_TradingSessionList
F_TradingSessionListRequest

I see a pattern in the fields. For some fields there is ID and Type. Browsing through the fields these look like the fields I’d like to make a report by. So I make these as dimensions. There are 163 of them. This looks erringly like the data warehouse I saw at a major stock exchange. Here is the dimension list.

D_AccountType
D_AdjustmentType
D_AdvTransType
D_AllocAccountType
D_AllocIntermedReqType
D_AllocLinkType
D_AllocNoOrdersType
D_AllocReportType
D_AllocSettlInstType
D_AllocTransType
D_AllocType
D_ApplReportType
D_ApplReqType
D_ApplResponseType
D_BasisPxType
D_BenchmarkPriceType
D_BidDescriptorType
D_BidRequestTransType
D_BidTradeType
D_BidType
D_BookingType
D_CPRegType
D_CollApplType
D_CollAsgnRespType
D_CollAsgnTransType
D_CommType
D_ConfirmTransType
D_ConfirmType
D_ContAmtType
D_ContingencyType
D_CrossType
D_CxlType
D_DeliveryType
D_DerivativeEventType
D_DerivativeInstrAttribType
D_DerivativeInstrumentPartySubIDType
D_DerivativeSecurityListRequestType
D_DerivativeSecuritySubType
D_DerivativeSecurityType
D_DeskType
D_DiscretionLimitType
D_DiscretionMoveType
D_DiscretionOffsetType
D_DlvyInstType
D_EmailType
D_EventType
D_ExecPriceType
D_ExecTransType
D_ExecType
D_ExpirationQtyType
D_IOITransType
D_IndividualAllocType
D_InstrAttribType
D_InstrumentPartySubIDType
D_LegBenchmarkPriceType
D_LegPriceType
D_LegRepoCollateralSecurityType
D_LegSecuritySubType
D_LegSecurityType
D_LegSettlType
D_LegStipulationType
D_LegSwapType
D_LiquidityIndType
D_ListExecInstType
D_ListStatusType
D_LotType
D_MDBookType
D_MDEntryType
D_MDFeedType
D_MDOriginType
D_MDQuoteType
D_MDSecSizeType
D_MDSubBookType
D_MDUpdateType
D_MassActionType
D_MassCancelRequestType
D_MassStatusReqType
D_MatchType
D_MiscFeeType
D_MsgType
D_MultiLegReportingType
D_Nested2PartySubIDType
D_Nested3PartySubIDType
D_Nested4PartySubIDType
D_NestedInstrAttribType
D_NestedPartySubIDType
D_NetworkRequestType
D_NetworkStatusResponseType
D_OrdType
D_OwnerType
D_OwnershipType
D_PartySubIDType
D_PegLimitType
D_PegMoveType
D_PegOffsetType
D_PegPriceType
D_PosAmtType
D_PosReqType
D_PosTransType
D_PosType
D_PriceLimitType
D_PriceType
D_QtyType
D_QuantityType
D_QuoteCancelType
D_QuotePriceType
D_QuoteRequestType
D_QuoteRespType
D_QuoteType
D_RefMsgType
D_RegistAcctType
D_RegistTransType
D_RepoCollateralSecurityType
D_RespondentType
D_ResponseTransportType
D_RootPartySubIDType
D_RoutingType
D_SecondaryPriceLimitType
D_SecondaryTrdType
D_SecurityListRequestType
D_SecurityRequestType
D_SecurityResponseType
D_SecuritySubType
D_SecurityType
D_SettlDeliveryType
D_SettlInstTransType
D_SettlObligTransType
D_SettlPartySubIDType
D_SettlPriceType
D_SettlType
D_SideMultiLegReportingType
D_SideTrdRegTimestampType
D_StandInstDbType
D_StatsType
D_StipulationType
D_StrategyParameterType
D_SubscriptionRequestType
D_TaxAdvantageType
D_TerminationType
D_TickRuleType
D_TradeReportTransType
D_TradeReportType
D_TradeRequestType
D_TrdRegTimestampType
D_TrdSubType
D_TrdType
D_TriggerOrderType
D_TriggerPriceType
D_TriggerType
D_UnderlyingCPRegType
D_UnderlyingCashType
D_UnderlyingLegSecuritySubType
D_UnderlyingLegSecurityType
D_UnderlyingRepoCollateralSecurityType
D_UnderlyingSecuritySubType
D_UnderlyingSecurityType
D_UnderlyingSettlPriceType
D_UnderlyingSettlementType
D_UnderlyingStipType
D_UndlyInstrumentPartySubIDType
D_UserRequestType
D_YieldRedemptionPriceType
D_YieldType
There you have it – a 150-dimension 60-fact table monster badly designed. No doubt you will have to write huge programs to analyze this data.
Doing it the right way is much harder and will take much longer and will need an understanding of the business. I need to get paid to do that, but I leave you with final thought. I have reduced a similar 150 + dimension 50 + monster at a major stock exchange to a hypothetical model with <15 dimensions and 3 fact tables. That should be an indicator of where you need to be.

Saturday, August 30, 2008

How and why are hundred dimension data warehouses designed? Is it the Birth of Object Dimensional Modeling.

As follow up on my previous blogs on the need for and process to design a simple dimensional model that can satisfy the analysis requirements, I have investigated how and why are 100 dimension monsters built. I acknolwedge conversations with colleagues on this subject.

Why
1) Sponsors of data warehouse want every single attribute from all data sources be stored in the data warehouse. This is the big bang approach - build once for all future. Users also don't want to be told a particuler attribute is not present in the data warehouse after its built. So safest approach is to store all.

2) Users are used to the data elements as they worked with during the business processes. Abstraction into new variables with take away the familiarity and there is reluctance.

3) Designers were never exposed to variable reduction using abstraction and hence don't consider it as an option. Designers have an improper understanding of what a dimension is - they slap dim infront of any source table name and assume its dimension of data warehouse. Further they perform horizontal partitioning of data (very OO)for performance reasons.

How:

1) Source databases' data models are copied into one structure. Each source (or a small set of sources) becomes a subject area.

2) Tables that completely (or very nearly overlap) between data sources are merged.

3) The purpose of each table is investigated. If it is a transaction - it becomes a fact table (2 or more tables can be merged). If it is a reference / lookup / business process and such non transaction table it become a dimension. Conform dimensions are set up between subject areas.

4) Other tables / data are merged into the identified tact / dimension tables. SCD properties are identified and metadata added. Load process metadata process tables / attributes are added.

This process reminds me of the creation of a collage and the end effect is very much the same. A very fragmented, but tightly knit data structure.

As my colleagues mentioned this is the bottom up approach - my interpretation of bottom up approach is different and I'll make it subject of future blog.

My Comments:

1) I look at this as data / database consolidation project not a data warehousing project.

2) I am not sure if there is any advantage of moving to dimensional modeling in such cases. Why not use 3NF in cases where you need to build a complete data repository for the enterprise. I am not sure 100 dimensional data web is more understandable, accurate or faster than 3 NF model. It fact it is the opposite.

3) Some techniques smell similar to ORM. And given the fact that most sources are already in Object models; are we looking at birth of ODM - Object Dimensional Mapping?

4) Data access software and tools should take into account the above structures. Significant ammount of code needs to be written to analyze data in above scenario. This looks very much like stove pipe analysis scenario. Doing stove pipe analysis and maintaining code in MDX is non trivial for example.

Vijay

Wednesday, August 20, 2008

Microsoft BI and CubeFilesystem

This is my real experience analyzing several Microsoft BI implementations – especially in Financial Institutions.

Programmers have been used to using file system as a place where they store and retrieve data in raw form, and they do significant logic in the code in (c++, Java etc) on top of it to get the required functionality. When given a new toy, MDX they used it exactly the same way – as a filesystem where they can build the logic on; instead of business-dimensional space where they should build the logic in! It is the structure of the dimensional space, and in its various measurements, and in definition of scales, in logic that is embedded in moving measures from scale to scale and in defining measures when all you have is data points that they should be concentrating on!

Not that its wrong. But why do you need to write a program in MDX. You may as well use Java, and aggregate data structures landed to disk if that is how you want to proceed.

Aug 21: Update

The last few blogs have led to some emails from friends asking me if what they are doing isn't apt use of the tool what then is proper use of the tool. I refered them to by previous blogs but promissed them to summarize them here
0) Define the space you are going to work in ie next 123
1) Identify dimensions as entities holding business variables. Is the problem I am looking to measure / model contain this variable as an independent parameter - not is this artifact of data representation. So lookups, data processing artifacts of what ever kind are not dimensions. Next question is are all these variables independent or can they be grouped as abstract entities thereby decreasing my dimensionality - keep in mind world is not 100 dimensional
2) Ensure your dimensions are orthogonal - ie there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
Now you have defined and placed the data properly in MS BI Cubes. Remember you still need to verify results against data points since error caluculation and propagation is not part of above scheme. You may incorporate it into above but it may be non trivial.
Enjoy

Tuesday, August 19, 2008

Where the mind is free......

Sometimes an intractable problem seems so only because it is misclassified. When you take a step back, look at big picture and think it is no longer an issue.
Case to point is a data problem that some hedge fund companies mentioned - how do you define an slowly changing dimension to take care of the issue where there are multiple identification scheme for the same security and each transaction uses one more more of these schemes. These schemes have non trivial many-many relationship witheach other. This is very much a back office problem.
But for Market analysis does it matter? Would storing it as a junk dimension be OK. Can we run association algorithmns on it to group records with different set of security identifiers together into one record as at some level they represent same security. Or in other words would the information contained in grouped security identifiers be enough for analysis. Hmmm once we think it became a simple Junk dimension + Data Mining problem.
Where the mind is free........

Friday, August 15, 2008

The Hedge Fund P&L Problem

I have talked to a few hedge funds over last couple of months and this is a common problem that they face when implementing Microsoft BI solutions without a deep understanding of dimensional modeling. They are looking for MDX experts to help them dig out of the hole, not recognizing that it’s a design issue.
Here is the simplified version of the problem. A Hedge fund owns m(n) quantities of n number of securities. Every day they can sell or buy a few more – both the quantities of security held (i.e. 100 of IBM), or number of various securities (IBM, MSFT, HLTH…) may change. The values of the securities themselves may change (lets simplify by assuming mark to market). They get a feed of daily transactions that were closed. They need to make daily (or periodic) Profit and Loss estimate. This could influence market communications and security transaction decisions. I assume accounting may make estimates of those. But you maybe looking to Build a SSAS cube in addition to study P&L by various dimensions to improve your business, or to build an executive dashboard.
Please read the last two blogs to understand the underpinnings of Dimensional modeling. I’ll build on those to explain how we model the above problem. First let me explain how it was modeled wrong in several places I am aware of.
Take the incoming purchases / sales as transactions / facts. Identify dimensions (made mistakes here but that’s beside the point). Build and SSAS cube out of those transactions. Now start writing MDX programs that deduce P&L based on those transactions. Its not that they didn’t understand the problem. They understand how to calculate P&L based on the data in hand. But they are using Cubes as filesystem and Writing MDX programs on it. If you want to do that use Java. Why struggle with MDX. It’s not that the business is complex. It’s just the design is wrong. The measures you are storing at base level of cube are wrong.
What does your car speedometer show you? It’s the speed of the car at that instant. Not acceleration not velocity – just the speed. Imagine of it showed acceleration and you constantly have to calculate speed in the back of your mind by adding (or subtracting) from initial speed of 0.
So what should be right measures in Hedge Fund P&L Business-Dimensional Space. That depends on the understanding of the problem. Don’t confuse sales transactions with Business events in Business-Dimensional Space. What we are trying to analyze in P&L is the Holdings – not the sales. Sales are just changes in the holdings i.e. a derived measure / calculation in the cube. Moving measures to reflect Holdings, not sales will solve the complexity issues with the P&L Cubes. You no longer have to traverse several cells in sales space to deduce P&L. P&L is simple calculation from differences in measures at cell A and B. This will not increase the size of the cube. The size of the cube is dictated by cardinality of the dimensions. It’s just the ETL that needs to reflect the logic of deducing holdings from initial holdings and sales.

And since I am often asked “There is no dimension that can fix the above problem”. A particular facet may be fixed but hack is a hack and the problem shows up elsewhere – in needing complex MDX or unknown data accuracy.

Thursday, August 14, 2008

Facts, measures, aggregations and cubes …….

In my last blog post I quickly skipped from facts to measures to aggregations. Let me make this technique clearer.

A business transaction can be represented by a point in business-dimensional space. The point has properties assigned to it based on the business transaction. In BI a single point is not of much use. You are more looking at trend as an indicator of what has happened and forecast what will happen and plan for it. One method of finding trends out of individual events with properties in multi dimensional space is to model the business with parameters and fit the model to the data to get the parameters. Another alternate way is to define statistical quantities in small units of the dimensions (instead of points). These statistical quantities are measures. They can be as simple as sum or distinct count or much more complex like EBIDA. Going from individual events to statistical quantities not only reduces the size of data but also its complexity – several hundred stock prices’ trend is encapsulated into one DJIA number that signifies the entire market.

Now the units of dimension are not universal or fixed. Infact several nested units can be defined for same dimension. At different scale you may notice different trends in the business as discovered in Data. NASDAQ may be trending down but NASDAQ 100 may be trending up. Or commodities Index may be trending up while DJIA is crashing. However it is non trivial to define and measure statistical quantities from one unit of dimension to another – or in other words move measures from one level in dimension hierarchy to another. Such techniques are called aggregations.

Its also possible to define aggregations when the dimensional space is projected into a space with fewer dimensions – e.g. You take location out of consideration and analyze business performance against other dimensions. Again this is non trivial to define and project measures onto such projections

There are always a large number of projects that can be defined in a multi dimensional space. If you go from 6 dimensional to 4 dimensional space there are 15 different projects possible. All of them may not have business relevance. A business may still have 5-20 relevant dimensional projects – also called aggregate tables of various reductions (In number of dimensions). Still it is too complex to maintain and defeats the purpose of data warehousing as a data deliverable to end users.

Solving both the dimension projection, and Dimensional Scale change problems is the invention of data cubes or multi dimensional databases. Multidimensional databases store data with measures already calculated (or capable of automatically calculating) against all dimensional scales and dimensional projects – ie hierarchies and slicing and dicing. They further provide the capability of drilling through – or go from larger to smaller dimensional scale or even to individual business transactions.
Its every cool if you know what you are doing. If not you can get burnt very fast and not know about it – too many companies are discovering that their MDX programmers are building Data Gyms of unknown accuracy instead of Cubes. That will be subject of future blog

Wednesday, August 13, 2008

Ensuring Business Integrity in Data Warehouses

Why do we build data warehouses?
A) To streamline, consolidate and make repeatable information generation and delivery.
B) To facilitate speedy, accurate and repeatable exploratory data analysis by non technical users which still has Business Integrity
C) To streamline, consolidate and make repeatable statistical data analysis efforts

In A) I have control over the information and hence can ensure the Business Integrity. In C) I am working with experts in statistical analysis and am just providing them a platform (data & technical) for their analysis.

The troublesome piece is B). This is where the innovation of Ralph Kimball comes in. He has developed a structure for data that is simple without sacrificing accuracy. This dimensional model assigns a transaction to one and only one point in multi dimensional space. The dimensions themselves are modeled after Cartesian system – orthogonal with each dimension having a hierarchy of scales defined. A point in multi dimensional space can have multiple events. Properties of these events are represented as measures at that point. These measures can be aggregated (simple sum, count or more complex functions) to define derived measures at each scale. The calculation of measures of a scale from events or measures at lower scale are non trivial and needs to be handled carefully.
It is the data warehouse architect’s goal to fit the data into the scheme described above. While extensions and innovations are always welcome they must be weighed against the requirement to ensure Business Integrity of exploratory data analysis without having to hire a statistician to look over their shoulder or in case of Data Gyms the programmer’s shoulder.

Friday, July 18, 2008

Of Data Warehouses and Data Gyms

Data Warehouse: Organizational Data is collected, integrated, Understood and structured for analysis. Data Access is very simple and results repeatable.
Data Gym: Organizational Data is collected and exposed to users through some heavy duty power lifting the sort you find in gym. Most of the logic is developed into programatic layer at data access level. SQL / MDX insted of being used for data access, is used for complex logic that puts serious stress on the machines and leads to performance issues that are addressed through typical programatic techniques.

Friday, May 16, 2008

Data Warehousing the Cloud.

Cloud Computing is the in thing today. Cloud computing is what its name says – to be able to build functional capability without overall structure – or with over all structure handled by someone else (black box computing may be apt in this case). This is different from outsourcing hardware to a data center. It is an approach to providing business functionality that aims to be faster and cheaper.

Properly structured enterprise application development will need use case analysis of business, design of a relational data model (the solution to legacy data cloud problem proposed by Codd), creation of object layer on top of relational model, and structured screens that lead user through the data entry / selection process. This structured approach ensures the data integrity, functional integrity and hence the business Integrity.

However it is also expensive – not just financially. To understand this look at what happened when structured GUI became Screen Cloud (Web). The innovation and usage skyrocketed while the cost dipped sharply when screen flowing into one another became screens with random links from page to many other pages. However the cloud concept was mostly limited to content and the applications were still structured. The Software and Hardware infrastructure was still structured.

The revolution in inception (if it comes to pass) is in the way IT hardware / software is provisioned and how that is stored to run applications and store and retrieve data. The Infrastructure cloud replaces the IT department with a black box that takes credit card and gives you incremental hardware and software resources to build your applications on. The infrastructure cloud also supports a data cloud i.e. another black box which will replace your Data Architect and DBA. All database maintenance work is automatically taken care of in cloud. The data model will be made non existent using hyper tables that support name value pairs instead of rows and columns. All the left is business logic that needs to be coded by use case in applications.

This is not a pipe dream. This is where we could end up in couple of year’s time. Or NOT. And this doesn’t have to be a failure. Most business can go back to being strict with the use cases they support in their business processes and put strict controls on the data integrity. I.e. if you call the business and if you spell your name wrong too bad you are not in our system and will not be supported. There will be no search functionality to rescue you. Since there is no structured data model in data cloud there is a possibility of different representation scheme ie variables and their relationships for different use cases even within a single application.

In summary – applications with a few use cases can be built and supported very easily and cheaply on the cloud.

Where does that leave data warehousing? We basically investigate, understand and merge data from different sources with different representation schemes (i.e. data models) into one specially designed representation scheme to support analytic efforts. We will likely continue to do that – this time integrating data from different use cases into a coherent analytic data model. That increases scope tremendously but its somewhat offset by using the cloud itself to build data warehouses assuming that the cloud has a special color (dark clouds?) to support different characteristics of Data Warehousing Loads. The ETL tools we use today have to change to support new requirements. Those changes will wait for market to firm itself.

Wednesday, March 12, 2008

Background of Previous Posts talk

I was working on a data warehouse for a multi product ASP. Faced with multitude of data issues, I became an advocate for data quality at source, slowly being responsible for the data source data architecture too. This talk was written in that phase.