I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Wednesday, September 24, 2008

Wednesday, September 17, 2008

Salient Features of Data Modeling Techniques

Relational: Normalization, Data Application Independence. One type of relation between entities

Object Model: Extensability, reuse, encapsulation. Conditional, Inheritance, association and other kinds of relationships between objects.

Dimensional: Precision, Understandability, Division of variables into measures / facts (Measures of business) and dimensions (parameters of Business Performance). One to many relationships between dimensions and facts. No relationships between dimensions. Hierarchhies within dimensions.

ORM: Mapping Objects onto relational space. Object relationships are built and maintained in application layer

ODM: Mapping Objects onto dimensional space. Multiple dimensional objects per dimension and multiple fact objects are all related to each other, and relationships implemented in reporting layer.

Wednesday, September 3, 2008

My Attempt at creating a hundred dimension data warehouse (Sarcastic)

To actually build a hundred + dimension data warehouse, I got myself to think like an OO programmer. I usually don’t work like this. I intend this to prove to my audience that bad design choices lead to complicated data warehouses. Again my refrain is if you want to write programs use java, and programming constructs. Don’t mix data warehousing and programming paradigms. If you do so you will end up not with Dimensional Model but with ODM – Object Dimensional model.
Vision: Create a data warehouse to store FIX data for analysis.
Data Sources: FIX message Logs. Reference data stored in applications.
Business Analysis: Based on understanding of FIX protocol. There are messages components and fields. Fields are used in messages, components or independently. There are 100+ messages. Taking a cursory look and removing all messages that look like not important, and making a fact table per message I end up with 64 fact tables. (Wow! almost the same number in a messed up data warehouse in a major stock exchange). The list is at the end of this blog.
Finding dimensions is a little bit more complex. There are nearly 1800 fields and 163 components. The trick is to find the fields and components used by the 64 fact tables below.
F_ResendRequest
F_Reject
F_SequenceReset
F_Logout
F_IOI
F_Advertisement
F_ExecutionReport
F_OrderCancelReject
F_Logon
F_News
F_Email
F_NewOrderSingle
F_NewOrderList
F_OrderCancelRequest
F_OrderCancelReplaceRequest
F_OrderStatusRequest
F_AllocationInstruction
F_ListCancelRequest
F_ListExecute
F_ListStatusRequest
F_ListStatus
F_AllocationInstructionAck
F_DontKnowTradeDK
F_QuoteRequest
F_Quote
F_SettlementInstructions
F_MarketDataRequest
F_MarketDataSnapshotFullRefresh
F_MarketDataIncrementalRefresh
F_MarketDataRequestReject
F_QuoteCancel
F_QuoteStatusRequest
F_MassQuoteAcknowledgement
F_SecurityDefinitionRequest
F_SecurityDefinition
F_SecurityStatusRequest
F_SecurityStatus
F_TradingSessionStatusRequest
F_TradingSessionStatus
F_MassQuote
F_BusinessMessageReject
F_BidRequest
F_BidResponse
F_ListStrikePrice
F_XML_non_FIX
F_RegistrationInstructions
F_RegistrationInstructionsResponse
F_OrderMassCancelRequest
F_OrderMassCancelReport
F_NewOrderCross
F_CrossOrderCancelReplaceRequest
F_CrossOrderCancelRequest
F_SecurityTypeRequest
F_SecurityTypes
F_SecurityListRequest
F_SecurityList
F_DerivativeSecurityListRequest
F_DerivativeSecurityList
F_NewOrderMultileg
F_MultilegOrderCancelReplace
F_TradeCaptureReportRequest
F_TradeCaptureReport
F_OrderMassStatusRequest
F_QuoteRequestReject
F_RFQRequest
F_QuoteStatusReport
F_QuoteResponse
F_Confirmation
F_PositionMaintenanceRequest
F_PositionMaintenanceReport
F_RequestForPositions
F_RequestForPositionsAck
F_PositionReport
F_TradeCaptureReportRequestAck
F_TradeCaptureReportAck
F_AllocationReport
F_AllocationReportAck
F_Confirmation_Ack
F_SettlementInstructionRequest
F_AssignmentReport
F_CollateralRequest
F_CollateralAssignment
F_CollateralResponse
F_CollateralReport
F_CollateralInquiry
F_NetworkCounterpartySystemStatusRequest
F_NetworkCounterpartySystemStatusResponse
F_UserRequest
F_UserResponse
F_CollateralInquiryAck
F_ConfirmationRequest
F_ContraryIntentionReport
F_SecurityDefinitionUpdateReport
F_SecurityListUpdateReport
F_AdjustedPositionReport
F_AllocationInstructionAlert
F_ExecutionAcknowledgement
F_TradingSessionList
F_TradingSessionListRequest

I see a pattern in the fields. For some fields there is ID and Type. Browsing through the fields these look like the fields I’d like to make a report by. So I make these as dimensions. There are 163 of them. This looks erringly like the data warehouse I saw at a major stock exchange. Here is the dimension list.

D_AccountType
D_AdjustmentType
D_AdvTransType
D_AllocAccountType
D_AllocIntermedReqType
D_AllocLinkType
D_AllocNoOrdersType
D_AllocReportType
D_AllocSettlInstType
D_AllocTransType
D_AllocType
D_ApplReportType
D_ApplReqType
D_ApplResponseType
D_BasisPxType
D_BenchmarkPriceType
D_BidDescriptorType
D_BidRequestTransType
D_BidTradeType
D_BidType
D_BookingType
D_CPRegType
D_CollApplType
D_CollAsgnRespType
D_CollAsgnTransType
D_CommType
D_ConfirmTransType
D_ConfirmType
D_ContAmtType
D_ContingencyType
D_CrossType
D_CxlType
D_DeliveryType
D_DerivativeEventType
D_DerivativeInstrAttribType
D_DerivativeInstrumentPartySubIDType
D_DerivativeSecurityListRequestType
D_DerivativeSecuritySubType
D_DerivativeSecurityType
D_DeskType
D_DiscretionLimitType
D_DiscretionMoveType
D_DiscretionOffsetType
D_DlvyInstType
D_EmailType
D_EventType
D_ExecPriceType
D_ExecTransType
D_ExecType
D_ExpirationQtyType
D_IOITransType
D_IndividualAllocType
D_InstrAttribType
D_InstrumentPartySubIDType
D_LegBenchmarkPriceType
D_LegPriceType
D_LegRepoCollateralSecurityType
D_LegSecuritySubType
D_LegSecurityType
D_LegSettlType
D_LegStipulationType
D_LegSwapType
D_LiquidityIndType
D_ListExecInstType
D_ListStatusType
D_LotType
D_MDBookType
D_MDEntryType
D_MDFeedType
D_MDOriginType
D_MDQuoteType
D_MDSecSizeType
D_MDSubBookType
D_MDUpdateType
D_MassActionType
D_MassCancelRequestType
D_MassStatusReqType
D_MatchType
D_MiscFeeType
D_MsgType
D_MultiLegReportingType
D_Nested2PartySubIDType
D_Nested3PartySubIDType
D_Nested4PartySubIDType
D_NestedInstrAttribType
D_NestedPartySubIDType
D_NetworkRequestType
D_NetworkStatusResponseType
D_OrdType
D_OwnerType
D_OwnershipType
D_PartySubIDType
D_PegLimitType
D_PegMoveType
D_PegOffsetType
D_PegPriceType
D_PosAmtType
D_PosReqType
D_PosTransType
D_PosType
D_PriceLimitType
D_PriceType
D_QtyType
D_QuantityType
D_QuoteCancelType
D_QuotePriceType
D_QuoteRequestType
D_QuoteRespType
D_QuoteType
D_RefMsgType
D_RegistAcctType
D_RegistTransType
D_RepoCollateralSecurityType
D_RespondentType
D_ResponseTransportType
D_RootPartySubIDType
D_RoutingType
D_SecondaryPriceLimitType
D_SecondaryTrdType
D_SecurityListRequestType
D_SecurityRequestType
D_SecurityResponseType
D_SecuritySubType
D_SecurityType
D_SettlDeliveryType
D_SettlInstTransType
D_SettlObligTransType
D_SettlPartySubIDType
D_SettlPriceType
D_SettlType
D_SideMultiLegReportingType
D_SideTrdRegTimestampType
D_StandInstDbType
D_StatsType
D_StipulationType
D_StrategyParameterType
D_SubscriptionRequestType
D_TaxAdvantageType
D_TerminationType
D_TickRuleType
D_TradeReportTransType
D_TradeReportType
D_TradeRequestType
D_TrdRegTimestampType
D_TrdSubType
D_TrdType
D_TriggerOrderType
D_TriggerPriceType
D_TriggerType
D_UnderlyingCPRegType
D_UnderlyingCashType
D_UnderlyingLegSecuritySubType
D_UnderlyingLegSecurityType
D_UnderlyingRepoCollateralSecurityType
D_UnderlyingSecuritySubType
D_UnderlyingSecurityType
D_UnderlyingSettlPriceType
D_UnderlyingSettlementType
D_UnderlyingStipType
D_UndlyInstrumentPartySubIDType
D_UserRequestType
D_YieldRedemptionPriceType
D_YieldType
There you have it – a 150-dimension 60-fact table monster badly designed. No doubt you will have to write huge programs to analyze this data.
Doing it the right way is much harder and will take much longer and will need an understanding of the business. I need to get paid to do that, but I leave you with final thought. I have reduced a similar 150 + dimension 50 + monster at a major stock exchange to a hypothetical model with <15 dimensions and 3 fact tables. That should be an indicator of where you need to be.