DataSense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Friday, April 23, 2010

Difference between Cube, Database and file system data processing

File system: You need to specify how to access data, retrieve data and analyze data to get measurements.

Database: You need to specify how to retrieve data and analyze data to get measurements

Cube: You read the measurements.

Vijay

Wednesday, April 7, 2010

Thoughts on Financial Data Analysis

Measurement / Experiment: We have a financial system that we try to understand by looking at the way it has acted in the past. From a purely historical perspective why something happened is important. In addition knowledge of why something happened in the past will help us anticipate future behavior, OR let us modify our interface with system (i.e. your assets) to our advantage. We do keep in mind that such modification in and itself may change the system.

When the world is tiny (i.e. shop in an isolated town) you can understand how to manage your business by intuition, or by experience. Like how your child does arithmetic. However as the system grows and becomes more complex you will find the need to abstract, i.e. to stand taller to see farther. Like your child now studies algebra. You represent the system in terms of variables and values of those variables and call it data. The job at this point is to
1) Find the right variables to represent the system
2) Discover the relationships among those variables.
As data size increases you will find yourself confused again. Statistics lets us think clearly about data.

Statistical Data Analysis: You use statistical methods (predominantly non Bayesian) to find the right variables and the relationships between those variables. You use event distributions some ways of representing shapes of those distributions etc. However the end result is to reduce the amount of data from gigabytes to few numbers that you can comprehend.

Mathematical Model: At this point the problem can be that you don’t have data covering all the possible ranges of all the relevant variables. So you are restricted to a subset of ranges of those variables or a subspace. Then the issue is how you will understand outside the subspace. You do so by building a heuristic Mathematical Model interpolating and extrapolating known information. Here you have the possibility of non continuous behavior being not captured e.g. a phase change like from water to ice. So every mathematical model comes with fine print. Further measurements will refine model.

Theory: Understanding the experimental measurements at a deeper level i.e. in terms of more abstract variables and relationships will let us sometimes sidestep the problems of mathematical models. Again we can look farther with the theory and make a measurement to verify if the theory is right.

The above process is very important. And historically has worked. There are specific times and places where you sidestep the process.

Data Mining: You can develop algorithms to sweep through the data and discover candidates for further statistical analysis. I.e. some of the routine grunt work can be automated into a repeatable algorithm. You can take it further and use these tools themselves as part of the statistical data analysis process e.g. you can use neural nets to find segments that you analyze further.

Monday, January 19, 2009

One cube or many

You define multiple cubes only in these cases
1) Security becomes easier to handle when there are different cubes for different users
2) YOu can split the business dimensional space into two sub spaces of less dimensionality, with few dimensions overlapping between them
3) You have different sets of measures that are local in different cubes although the business dimensional space is similar. IN this case you can even have the same cube but different measure group.

Vijay

Local Measures in Business Dimensional Space

Few Posts down I have decribed the Business dimensional space forming the basis for Microsoft Cubes. They are repeated next. (0-4) 4-6 I have defined how to place the measures in the business dimensional space. In this post I extend the scheme to define and use local measures.

What is a local measure: A local measure is completely defined as a function of the dimensional coordinates of the cell it is contained in AND this property extends across all hierarchies that the cell is part of. i.e. a measure profit of two dimensional cube Date and Store, is local because it is a function of a particular store on a certain day (profit of a store in a day), AND a function of a Particular Division the store is part of and the month the day is part of (profit of a division in a month).

Use of non local measures considerably complicates the retrieval of data from the cube and defeats the purpose of the cube. A similar result can be acchieved in that case by writing programs on raw / aggregated data.

Another Example: A hedge fund gets data as delta positions. Storing the delta positions in the cube results in a simple measure - current value being a non local measure resulting in the development of a program in MDX to browse the entire cube. This is a trivial Local measure of delta positions are converted into positions before loading into a cube. However for a measure transactional costs delta positions may be local.

This basically defines when you have one cube and when you have many cubes for a particuler business problem. Thats the subject of next blog

Scheme for defining Cubes:

0) Define the space you are going to work in ie next 123
1) Identify dimensions as entities holding business variables. Is the problem I am looking to measure / model contain this variable as an independent parameter - not is this artifact of data representation. So lookups, data processing artifacts of what ever kind are not dimensions. Next question is are all these variables independent or can they be grouped as abstract entities thereby decreasing my dimensionality - keep in mind world is not 100 dimensional
2) Ensure your dimensions are orthogonal - ie there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
Now you have defined and placed the data properly in MS BI Cubes. Remember you still need to verify results against data points since error caluculation and propagation is not part of above scheme. You may incorporate it into above but it may be non trivial.

Wednesday, September 24, 2008

Wednesday, September 17, 2008

Salient Features of Data Modeling Techniques

Relational: Normalization, Data Application Independence. One type of relation between entities

Object Model: Extensability, reuse, encapsulation. Conditional, Inheritance, association and other kinds of relationships between objects.

Dimensional: Precision, Understandability, Division of variables into measures / facts (Measures of business) and dimensions (parameters of Business Performance). One to many relationships between dimensions and facts. No relationships between dimensions. Hierarchhies within dimensions.

ORM: Mapping Objects onto relational space. Object relationships are built and maintained in application layer

ODM: Mapping Objects onto dimensional space. Multiple dimensional objects per dimension and multiple fact objects are all related to each other, and relationships implemented in reporting layer.

Wednesday, September 3, 2008

My Attempt at creating a hundred dimension data warehouse (Sarcastic)

To actually build a hundred + dimension data warehouse, I got myself to think like an OO programmer. I usually don’t work like this. I intend this to prove to my audience that bad design choices lead to complicated data warehouses. Again my refrain is if you want to write programs use java, and programming constructs. Don’t mix data warehousing and programming paradigms. If you do so you will end up not with Dimensional Model but with ODM – Object Dimensional model.
Vision: Create a data warehouse to store FIX data for analysis.
Data Sources: FIX message Logs. Reference data stored in applications.
Business Analysis: Based on understanding of FIX protocol. There are messages components and fields. Fields are used in messages, components or independently. There are 100+ messages. Taking a cursory look and removing all messages that look like not important, and making a fact table per message I end up with 64 fact tables. (Wow! almost the same number in a messed up data warehouse in a major stock exchange). The list is at the end of this blog.
Finding dimensions is a little bit more complex. There are nearly 1800 fields and 163 components. The trick is to find the fields and components used by the 64 fact tables below.
F_ResendRequest
F_Reject
F_SequenceReset
F_Logout
F_IOI
F_Advertisement
F_ExecutionReport
F_OrderCancelReject
F_Logon
F_News
F_Email
F_NewOrderSingle
F_NewOrderList
F_OrderCancelRequest
F_OrderCancelReplaceRequest
F_OrderStatusRequest
F_AllocationInstruction
F_ListCancelRequest
F_ListExecute
F_ListStatusRequest
F_ListStatus
F_AllocationInstructionAck
F_DontKnowTradeDK
F_QuoteRequest
F_Quote
F_SettlementInstructions
F_MarketDataRequest
F_MarketDataSnapshotFullRefresh
F_MarketDataIncrementalRefresh
F_MarketDataRequestReject
F_QuoteCancel
F_QuoteStatusRequest
F_MassQuoteAcknowledgement
F_SecurityDefinitionRequest
F_SecurityDefinition
F_SecurityStatusRequest
F_SecurityStatus
F_TradingSessionStatusRequest
F_TradingSessionStatus
F_MassQuote
F_BusinessMessageReject
F_BidRequest
F_BidResponse
F_ListStrikePrice
F_XML_non_FIX
F_RegistrationInstructions
F_RegistrationInstructionsResponse
F_OrderMassCancelRequest
F_OrderMassCancelReport
F_NewOrderCross
F_CrossOrderCancelReplaceRequest
F_CrossOrderCancelRequest
F_SecurityTypeRequest
F_SecurityTypes
F_SecurityListRequest
F_SecurityList
F_DerivativeSecurityListRequest
F_DerivativeSecurityList
F_NewOrderMultileg
F_MultilegOrderCancelReplace
F_TradeCaptureReportRequest
F_TradeCaptureReport
F_OrderMassStatusRequest
F_QuoteRequestReject
F_RFQRequest
F_QuoteStatusReport
F_QuoteResponse
F_Confirmation
F_PositionMaintenanceRequest
F_PositionMaintenanceReport
F_RequestForPositions
F_RequestForPositionsAck
F_PositionReport
F_TradeCaptureReportRequestAck
F_TradeCaptureReportAck
F_AllocationReport
F_AllocationReportAck
F_Confirmation_Ack
F_SettlementInstructionRequest
F_AssignmentReport
F_CollateralRequest
F_CollateralAssignment
F_CollateralResponse
F_CollateralReport
F_CollateralInquiry
F_NetworkCounterpartySystemStatusRequest
F_NetworkCounterpartySystemStatusResponse
F_UserRequest
F_UserResponse
F_CollateralInquiryAck
F_ConfirmationRequest
F_ContraryIntentionReport
F_SecurityDefinitionUpdateReport
F_SecurityListUpdateReport
F_AdjustedPositionReport
F_AllocationInstructionAlert
F_ExecutionAcknowledgement
F_TradingSessionList
F_TradingSessionListRequest

I see a pattern in the fields. For some fields there is ID and Type. Browsing through the fields these look like the fields I’d like to make a report by. So I make these as dimensions. There are 163 of them. This looks erringly like the data warehouse I saw at a major stock exchange. Here is the dimension list.

D_AccountType
D_AdjustmentType
D_AdvTransType
D_AllocAccountType
D_AllocIntermedReqType
D_AllocLinkType
D_AllocNoOrdersType
D_AllocReportType
D_AllocSettlInstType
D_AllocTransType
D_AllocType
D_ApplReportType
D_ApplReqType
D_ApplResponseType
D_BasisPxType
D_BenchmarkPriceType
D_BidDescriptorType
D_BidRequestTransType
D_BidTradeType
D_BidType
D_BookingType
D_CPRegType
D_CollApplType
D_CollAsgnRespType
D_CollAsgnTransType
D_CommType
D_ConfirmTransType
D_ConfirmType
D_ContAmtType
D_ContingencyType
D_CrossType
D_CxlType
D_DeliveryType
D_DerivativeEventType
D_DerivativeInstrAttribType
D_DerivativeInstrumentPartySubIDType
D_DerivativeSecurityListRequestType
D_DerivativeSecuritySubType
D_DerivativeSecurityType
D_DeskType
D_DiscretionLimitType
D_DiscretionMoveType
D_DiscretionOffsetType
D_DlvyInstType
D_EmailType
D_EventType
D_ExecPriceType
D_ExecTransType
D_ExecType
D_ExpirationQtyType
D_IOITransType
D_IndividualAllocType
D_InstrAttribType
D_InstrumentPartySubIDType
D_LegBenchmarkPriceType
D_LegPriceType
D_LegRepoCollateralSecurityType
D_LegSecuritySubType
D_LegSecurityType
D_LegSettlType
D_LegStipulationType
D_LegSwapType
D_LiquidityIndType
D_ListExecInstType
D_ListStatusType
D_LotType
D_MDBookType
D_MDEntryType
D_MDFeedType
D_MDOriginType
D_MDQuoteType
D_MDSecSizeType
D_MDSubBookType
D_MDUpdateType
D_MassActionType
D_MassCancelRequestType
D_MassStatusReqType
D_MatchType
D_MiscFeeType
D_MsgType
D_MultiLegReportingType
D_Nested2PartySubIDType
D_Nested3PartySubIDType
D_Nested4PartySubIDType
D_NestedInstrAttribType
D_NestedPartySubIDType
D_NetworkRequestType
D_NetworkStatusResponseType
D_OrdType
D_OwnerType
D_OwnershipType
D_PartySubIDType
D_PegLimitType
D_PegMoveType
D_PegOffsetType
D_PegPriceType
D_PosAmtType
D_PosReqType
D_PosTransType
D_PosType
D_PriceLimitType
D_PriceType
D_QtyType
D_QuantityType
D_QuoteCancelType
D_QuotePriceType
D_QuoteRequestType
D_QuoteRespType
D_QuoteType
D_RefMsgType
D_RegistAcctType
D_RegistTransType
D_RepoCollateralSecurityType
D_RespondentType
D_ResponseTransportType
D_RootPartySubIDType
D_RoutingType
D_SecondaryPriceLimitType
D_SecondaryTrdType
D_SecurityListRequestType
D_SecurityRequestType
D_SecurityResponseType
D_SecuritySubType
D_SecurityType
D_SettlDeliveryType
D_SettlInstTransType
D_SettlObligTransType
D_SettlPartySubIDType
D_SettlPriceType
D_SettlType
D_SideMultiLegReportingType
D_SideTrdRegTimestampType
D_StandInstDbType
D_StatsType
D_StipulationType
D_StrategyParameterType
D_SubscriptionRequestType
D_TaxAdvantageType
D_TerminationType
D_TickRuleType
D_TradeReportTransType
D_TradeReportType
D_TradeRequestType
D_TrdRegTimestampType
D_TrdSubType
D_TrdType
D_TriggerOrderType
D_TriggerPriceType
D_TriggerType
D_UnderlyingCPRegType
D_UnderlyingCashType
D_UnderlyingLegSecuritySubType
D_UnderlyingLegSecurityType
D_UnderlyingRepoCollateralSecurityType
D_UnderlyingSecuritySubType
D_UnderlyingSecurityType
D_UnderlyingSettlPriceType
D_UnderlyingSettlementType
D_UnderlyingStipType
D_UndlyInstrumentPartySubIDType
D_UserRequestType
D_YieldRedemptionPriceType
D_YieldType
There you have it – a 150-dimension 60-fact table monster badly designed. No doubt you will have to write huge programs to analyze this data.
Doing it the right way is much harder and will take much longer and will need an understanding of the business. I need to get paid to do that, but I leave you with final thought. I have reduced a similar 150 + dimension 50 + monster at a major stock exchange to a hypothetical model with <15 dimensions and 3 fact tables. That should be an indicator of where you need to be.