I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Friday, April 23, 2010

Difference between Cube, Database and file system data processing

File system: You need to specify how to access data, retrieve data and analyze data to get measurements.

Database: You need to specify how to retrieve data and analyze data to get measurements

Cube: You read the measurements.

Vijay

Wednesday, April 7, 2010

Thoughts on Financial Data Analysis

Measurement / Experiment: We have a financial system that we try to understand by looking at the way it has acted in the past. From a purely historical perspective why something happened is important. In addition knowledge of why something happened in the past will help us anticipate future behavior, OR let us modify our interface with system (i.e. your assets) to our advantage. We do keep in mind that such modification in and itself may change the system.

When the world is tiny (i.e. shop in an isolated town) you can understand how to manage your business by intuition, or by experience. Like how your child does arithmetic. However as the system grows and becomes more complex you will find the need to abstract, i.e. to stand taller to see farther. Like your child now studies algebra. You represent the system in terms of variables and values of those variables and call it data. The job at this point is to
1) Find the right variables to represent the system
2) Discover the relationships among those variables.
As data size increases you will find yourself confused again. Statistics lets us think clearly about data.

Statistical Data Analysis: You use statistical methods (predominantly non Bayesian) to find the right variables and the relationships between those variables. You use event distributions some ways of representing shapes of those distributions etc. However the end result is to reduce the amount of data from gigabytes to few numbers that you can comprehend.

Mathematical Model: At this point the problem can be that you don’t have data covering all the possible ranges of all the relevant variables. So you are restricted to a subset of ranges of those variables or a subspace. Then the issue is how you will understand outside the subspace. You do so by building a heuristic Mathematical Model interpolating and extrapolating known information. Here you have the possibility of non continuous behavior being not captured e.g. a phase change like from water to ice. So every mathematical model comes with fine print. Further measurements will refine model.

Theory: Understanding the experimental measurements at a deeper level i.e. in terms of more abstract variables and relationships will let us sometimes sidestep the problems of mathematical models. Again we can look farther with the theory and make a measurement to verify if the theory is right.

The above process is very important. And historically has worked. There are specific times and places where you sidestep the process.

Data Mining: You can develop algorithms to sweep through the data and discover candidates for further statistical analysis. I.e. some of the routine grunt work can be automated into a repeatable algorithm. You can take it further and use these tools themselves as part of the statistical data analysis process e.g. you can use neural nets to find segments that you analyze further.