DataSense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Wednesday, February 21, 2007

BI in its own network domain?

A very significant piece of effort into BI initiatives go into putting up proper infrastructure in place to handle massive flows of data. Further the security and other requirements are different. It is not uncommon to see BI Hardware being shoe horned into current Network and Hardware setup, with CORP, DeV, QA and Prod thus forcing BI to deal with the security and other policies that are alien to it. e.g.: Any tool installation requires corp approval, handling and management. They don't have the skillset to install, maintain or manage the tool.

A way out is to create a BI domain and place a firewall around it but let policies inside the domain to be more lax.

Monday, February 19, 2007

How many dimensions should I have in my Data Warehouse?

I can tell how many you should not have - 25, 50 ,100 or more. Because I refuse to believe there are 100 significant entities by which you are analyzing your business. To understand this lets start off with the definition of a dimension.

A dimension is an ENTITY by which you analyze your business. Even 6 months back I used to believe a dimension is a Variable by which you analyze your business. Then I happened to work on projects with more than 100 dimensions tables. I realized the scope for misunderstanding in the definition.

What is an Entity and what is a variable or an attribute? A customer is an entity and a customer name is an attribute. Customer state, Customer Zip, Customer Type are some more attributes of same entity. The Dimension is Customer. You'll have 100 dimensions in your data warehouse only if you create a dimension for each attribute - e.g. Dim_Customer_Type. Or Dim_Tick_Type. Take your 100 + dimensions and make entities out of them. You'll end up with 10 -15 entities or less. The name of the game is to get to right level of abstraction.

When you have 100 + dimensions it becomes impossible to develop build and manage the system. There is no ROI. Your resources don't increase just because you need them. YOur resources are what you can afford. Insted of beating your resources to the ground whether people or machines, proper design and a simple data warehouse that satisfies the requirements should be built.

Sunday, February 18, 2007

What is a BIG data warehouse?

When is a data warehouse BIG? Number of tables? A couple of Hundred Dimension tables and few dozen fact tables. That defeats the very purpose of data warehousing? A data warehouse must have as many dimensions as you want to anyalyze the business by. And no one thinks in 150 dimension space. It must have as many fact tables as there are completely independent transactions - when they are of completely different grain & location in dimension space that they cannot be put together.

Is it then a large fact table? Do you really want to store and analyze on fly data at such lowest of low grain? The transaction should be at a grain you want to analyze the business by.

A data warehouse is BIG when it is at it is BIG at its right size. Otherwise its just a BIG Mess.

Logic Dependency in Business

ENTERPRISE DATA Architecture Standards

Examples of Data Architecture standards to aid in standards identification..These are not proposals but rather a list of standards in use in other Organizations.

Data Architecture
Principle: 1 Design the enterprise Data Architecture so it increases and facilitates the sharing of data across the enterprise.
q Sharing of data greatly reduces data entry and maintenance efforts.
q Data sharing requires an established infrastructure for widespread data access. This includes integration with the Application, Componentware, Integration, Messaging, Network, and Platform Architectures.
q Consistent shared data definitions ensure data accuracy, integrity, and consistency.
q Data sharing reduces the overall resources required to maintain data across the enterprise.

Data Architecture
Principle: 2 Create and maintain roles and responsibilities within the distributed enterprise Data Architecture to facilitate the management of data. This requires a working relationship between the business user organizations and information services(IS).
Business responsibilities are to:
q Provide accurate business definitions of data.
q Develop enterprise-wide business views of shared data.
q Provide business drivers to support centralized data administration.
q Make metadata available.
q Define security requirements for data.
IS Responsibilities are to provide a robust technical infrastructure that includes:
q Open, accessible, and adaptable database management systems (DBMSs).
q Centralized data administration.
q Data replication facilities.
q Backup and recovery.
q Security.
q Database monitoring tools.
q Data quality monitoring tools.
q Application mechanisms for helping to ensure accurate data input.

Metadata
Principle: 3 When designing or modifying a database, review the Metadata Repository for existing standard and proposed data elements before implementing a new database to ensure data elements are defined according to Metadata Repository standards.
Design reviews are essential to ensure that shared firmwide data is defined consistently across all applications. Design reviews also determine whether data that already exists is consistently defined and not redundantly stored. Design reviews should document the following:
q Where is this application getting its data?
q What other applications are getting data from this application?
q Is data used by this application defined consistently with firmwide definitions? If not, is there a plan to define the data according to enterprise definitions?
q A design review evaluates the data requirements of a project and identifies the following:
q A data requirement that can be solved by using existing metadata element.
q Data not already identified as metadata must be proposed as an inter-agency or firmwide standard to the Metadata Element Review Team to become metadata.
q Access is available for application development projects to reference the metadata repository in order to actively research data requirements. Review the existing standard and proposed data
elements in the metadata repository before implementing a new database to ensure data elements are defined according to standards.
q Key information about data is stored in the systems that are already implemented in the firm. If possible, evaluate existing systems to propose firmwide data elements.

Data Modeling
Principle: 4 Take the Entity-Relation (ER) model to the third normal form, then denormalize where necessary for performance.
q The third normal form is the most commonly recommended form for the ER model.
q In some cases, a denormalized database can perform faster as there can be fewer joins, or reduced access to multiple tables. This process saves both physical and logical input and output requirements.

Data Modeling
Principle: 5 Restrict free form data entry where possible.
q In the design phase, consider the values that may be input into a field. These values or domains should be normalized so that data is consistent across records or instances. For example, using consistent values for gender or address information.
q Use look-up tables and automate data entry for column or attribute domain values to restrict what is entered in a column.

Data Access Implementation
Principle: 6 Validate data at every practical level to ensure data quality and avoid unnecessary network traffic.
q Validation can be coded into multiple tiers of the n-tier architecture to ensure that only valid data is processed and sent across the network. For example, an invalid field entered in a data entry form can be corrected before data is written to the database.
q Data integrity verification rules should be used when possible.

Data Access Implementation
Principle: 7 Design the data access infrastructure to support the transparency of the location and access of data by each application.
q This means designing an N-tier architecture where all data access is managed through a middle tier. This design makes databases easy to relocate, restructure, or re-platform the back end services with minimal disruption to the applications that use them. It is essential for adaptive systems..
q A client should not send SQL requests directly to a server. Instead of using SQL code, the client should communicate with the database through data access rules. The application receives a request from a client and sends a message to the data access rule. The data access rule sends an SQL call to the database. With this method, the client does not send SQL to the server, it sends a request for work.

Data Access Implementation
Principle: 8 For data quality management, implement tools, methods, processes and policies to provide high-level data accuracy and consistency across distributed platforms.
q Both business users and Information Technology (IT) staff are responsible for data accuracy and consistency. Policies and procedures must be established to ensure the accuracy of data.
q IT staff is responsible for and must provide security mechanisms to safeguard all data under IT control. The business users must determine functional security requirements, while the physical security must be provided by IT.
q Applied systems management provides safeguards against data loss and corruption and provides the means of recovering data after system failures. This implies that effective backup and recovery systems are imperative and that data can be recovered in a timely basis regardless of the cause of loss.
q For critical functions, plan for survivability under both normal operations and degraded operations.

Data Security
Principle: 9 Record information about users and their connections as they update and delete data. Auditing can determine who updated a record and their connection data.
The information that can be captured by the application includes:
q The user account the user logged in with.
q The TCP/IP address the connected user's workstation.
q The certificate information (if using certificates) about that user.
q The old values that were stored in the record(s) before the modification.
q The new values that were input to the record(s).

Data Security
Principle: 10 Protect database servers from hardware failures and physical OS
attacks.
q Database servers must be located in a climate-controlled, restricted-access facility, and preferably a fully staffed data center. Uninterruptible power supplies (UPSs), redundant disks, fans, and power supplies must be used.

Data Warehouse
Principle: 11 Perform benchmarks on the database design before constructing the database.
q Expect to make changes and adjustments throughout development.
q Changes during the early cycles up to, and including implementation, are a primary mechanism of performance tuning.

Data Hygiene Tools
Principle: 12 Ensure data entry quality is built into new and existing application systems to reduce the risk of inaccurate or misleading data in OLTP systems and to reduce the need for data hygiene.
q Provide well-designed data-entry services that are easy to use (e.g., a GUI front end with selection lists for standard data elements like text descriptions, product numbers, etc.).
q The services should also restrict the values of common elements to conform to data hygiene rules.
q The system should be designed to reject invalid data elements and to assist the end user in correcting the entry.
q All updates to an authoritative source OLTP database should occur using the business rules that own the data, not by direct access to the database.
q Attention to detail should be recognized and rewarded.

Data Warehouse FAQ (Draft)

1. What is a data warehouse?

A data warehouse will be defined to be the data, software, hardware, policies, processes, tasks and documents that comprise
· The Data warehouse database
o Different disks
o Different data structures
o Different hardware architecture
o Ease of use
o Cleaned data
o Integrated data from many systems
o Longer non volatile storage
o Security
· Business Analysis
· ETL
· Derived data and Aggregations
· The data Cleaning Mechanisms
· Data Quality Assurance Processes
· Data Access mechanisms including data mining tools
· Operations, management and maintenance
· Data Extracts & other interfaces for other systems

2. Why do you build a data warehouse?
The need for data warehousing is mainly caused by the limitations of transaction processing systems.
· Different disks: Separate out querying and reporting from different systems / disks from transaction processing systems.
· Different data structures: Structure your data in database or disk in such a way that querying and reporting is fast but such structure is not appropriate for transaction processing.
· Different hardware architecture: Use different hardware the architecture not needed for transaction processing.
· Ease of use: To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel.
· Cleaned data: To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems
· Integrated data from many systems: To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only
· Longer storage: To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time
· Security: To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases
3. What are the differences between OLAP and OLTP systems / processes?

OLTP: On-Line Transaction Processing
–Many short transactions (queries + updates)
–Examples:
•Update account balance
•Enroll in course
•Add book to shopping cart
–Queries touch small amounts of data (one record or a few records) and are predictable.
–Updates are frequent
–Concurrency is biggest performance concern
OLAP: On-Line Analytical Processing
–Long transactions, complex queries
–Examples:
•Report total sales for each department in each month
•Identify top-selling books
•Count classes with fewer than 10 students
–Queries touch large amounts of data and are un predictable.
–Updates are infrequent – almost non existent
–Individual queries can require lots of resources
4. OLAP, Data warehouse, Data Mart, Dimensional Database, Star Schema, Cubes are sometimes used interchangeably. Sort out the differences.
OLAP: Online analytical processing. OLAP is a method of data possessing supporting analytical requirements as opposed to transaction requirements supported by OLTP. It encompasses all of the above.
Data warehouse: A data warehouse is a specific way of supporting OLAP requirements.
Data Mart: A data mart is a specific way of supporting OLAP requirements. It is similar to Data Warehouse but focused to a particular subject / set of questions.
Dimensional Database:
Star Schema: Describes the Structure of the Data Warehouse / Data Mart data in relational database. It is a relational implementation of a dimensional model.

5. What is dimensional Model.
A form of data modeling that facilitates OLAP style analytical query building. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. In this form of modeling, the emphasis is not on normalizing the data but more on ease of query building.
6. What is a dimension?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. Dimension (dimension tables) can contain hierarchies of variables. Each value a dimension tables is represented by a row in the dimension table. The row holds Surrogate key, Natural key / application data source key, attributes representing the value and meta data.
7. What is a dimension hierarchy?
Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). A single variable can be represented at different levels –e.g. date dimension can be represented by day, week, month, quarter year etc and the levels can be fully overlapping a more finer representation i.e. month encompasses day values. These are dimension hierarchies. Some dimensions have multiple hierarchies based on them. (A ragged dimension is when certain levels are missing for certain values of dimension). Dimension hierarchies facilitate rollup, drill down, slicking and dicing.
In a customer dimension data at the Customers level is aggregated into the Cities level, which, in turn, is aggregated into the Countries/Areas, Continents/Regions, and Global levels.
8. What is fact?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). Facts are used to analyze the business for KPI, opportunities identification, informed decision making, management decisions, prediction (e.g. trending).

9. What are different types of dimensions? Explain? (or What is SCD / Slowly Changing Dimensions, Type 1 dimensions, Type 2 dimensions)
One way to categorize dimensions is by the way the changes to the values of the dimension variables are handled in the data warehouse. If the changed values overwrite the current values then its Type 1 Dimension. If the history of the value is maintained by using a current key / flag and / or data range when the dimension is valid it is called Type 2 dimension. A very infrequently used type 3 dimension adds fields to the same record to store newer values. (Infrequent since this is hard to do in relational databases)

10. What is a Surrogate key?
According to the Webster’s Unabridged Dictionary, a surrogate is an "artificial or synthetic product that is used as a substitute for a natural product." A surrogate key in a data warehouse is an artificial or synthetic key that is used as a substitute for a natural key.
Why
a. Multiple systems can have the same value of natural key representing different values of a dimension variable
b. Source system values may change
c. Source system may not have a well-defined / accurate natural / primary key.
d. For performance reasons as Natural key / source system primary key may not be an integer.
11. What is metadata?
Metadata is “data about data”. Warehouse metadata is descriptive about warehouse data and the process used in creating the warehouse. Metadata is the key to understanding the warehouse. It helps you to locate, manage, and use information.

Business Metadata: Business Metadata consists of Business concepts (such as organizational structure or business Resources Involved:) that will decide how the data is captured, or how the information is presented. It also includes information about the business processes supported by the data, their contacts. Business Metadata should be associated with technical metadata.

Technical Metadata: Technical metadata includes architectural, or developmental information about the location and formats and meanings of the data as it is processed and stored in the various modules of the data warehouse and reports.

Operational Metadata: Run Statistics, Data Quality statistics, Job History – the information about the actual operations of the data warehouse is the operational metadata.

12. What is Natural key / application data source key?

13. What is star schema?

14. What is snowflake schema?
15. What is degenerate dimension
16. What is factless fact
17. What are different ways of loading fact tables. Explain (snapshot, drift)
18. What is a stage? Why do you need stage
19. .Tell me about cubes
20. Full process or incremental
21. Are you good with data cleansing?
22. How do you handle changing dimensions?
23. Talk about the Kimball vs. Inmon approaches.
24. Talk about the concepts of ODS and information factory.
25. Talk about challenges of real-time load processing vs. batch.
26. Know the difference between Logical and Physical models.
27. Know how to use the Reverse Engineer and Comparison features.
28. The dimension model feature is pretty weak, but you might want to know how Erwin treats dimensional modeling.
29. Source target mapping
30. What is source qualifier?
31. Difference between DSS & OLTP?
32. Explain grouped cross tab?
33. Hierarchy of DWH?
34. How many repositories can we create in Informatica?
35. What is surrogate key?
36. What is difference between Mapplet and reusable transformation?
37. What is aggregate awareness?
38. Explain reference cursor?
39. What are parallel querys and query hints?
40. DWH architecture?
41. What are cursors?
42. Advantages of de normalized data?
43. What is operational data source (ODS)?
44. What is meta data and system catalog?
45. What is factless fact schema?
46. What is confirmed dimension?
47. What is the capacity of power cube?
48. Difference between PowerPlay transformer and power play reports?
49. What is IQD file?
50. What is Cognos script editor?
51. What is difference macros and prompts?
52. What is power play plug in?
53. Which kind of index is preferred in DWH?
54. What is hash partition?
55. What is DTM session?
56. How can you define a transformation? What are different types of transformations in Informatica?
57. What is mapplet?
58. What is query panel?
59. What is a look up function? What is default transformation for the look up function?
60. What is difference between a connected look up and unconnected look up?
61. What is staging area?
62. What is data merging, data cleansing and sampling?
63. What is up date strategy and what are th options for update strategy?
64. OLAP architecture?
65. What is subject area?
66. Why do we use DSS database for OLAP tools?

67. Business Objects Popular Q&A:
68. What is a universe?
69. Analysis in business objects?
70. Who launches the supervisor product in BO for first time?
71. How can you check the universe?
72. What are universe parameters?
73. Types of universes in business objects?
74. What is security domain in BO?
75. Where will you find the address of repository in BO?
76. What is broad cast agent?
77. In BO 4.1 version what is the alternative name for broadcast agent?
78. What services the broadcast agent offers on the server side?
79. How can you access your repository with different user profiles?
80. How many built-in objects are created in BO repository?
81. What are alertors in BO?
82. What are different types of saving options in web intelligence?
83. What is batch processing in BO?
84. How can you first report in BO by using broadcast agent?
85. Can we take report on Excel in BO?
86. What is KPI

Bubble Diagram of Architecture

Rules for creating Dimensions

Dimensions are entities - not attributes. So a dimension called tick_type is silly.
Dimensions don’t talk to other Dimensions.
Every attribute of the dimention belongs to the entity only.
If there needs to be a link between two Dimensions create a bridge table .
If there are too many bridge tables combine them to be a factless fact table.

What is Data Sense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. How differently? That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming.

Resume

Vijaya Kumar Nadendla PhD
51 Linwood Dr, Monroe, NJ
609 655 3351
609 851 6795 (C)
vnadendla@gmail.com

Objective: Looking for a contract Data Warehouse Architect / Manager position. Comfortable in taking over current (troubled / successful) projects or in starting brand new initiatives.

Professional Summary:

About fifteen Years experience building 16 data warehouses and in quantitative data analysis.

Hands on in design and development with / without ETL tools (Informatica, Ab Initio, Business Objects, SSIS, SSAS, Proclarity) at expert developer level without being a tool jockey.

Developed a big picture view of the data warehouse encompassing purpose and vision, data, Data model design, database design, ETL, impact on design of infrastructure, life cycle, expectations management.

Build and manage an accomplished team to build a data warehouse from concept to deployment and beyond. Able to handle onshore / offshore teams and additional management responsibilities outside data warehousing (Electronics development, web application development etc)

Interface with executive sponsors and serve as the primary point of contact to establish project plans, work breakdown structures, milestones, deliverables, status reports and risk assessments.

Provide hands on leadership and direction to the development of a Data management and business intelligence competency including architecture and standards at multi terabyte scale. Establish a roadmap for the above.

Proper mathematical and statistical data analysis and model building background (in Experimental High energy physics analyzing billions of events). Developed an understanding of financial market data, Macro economic variable, derivatives etc to help properly design and manage financial data.

Education: Ph.D. in Physics, LSU, Baton Rouge, Louisiana, Aug 1996
Experience:

BI Consultant, Bloomberg, New York, NY Nov 06 – Current
· Redesigned and developed a multi terabyte scale data warehouse analyzing the user behaviour on bloomberg terminals using SSIS and SSAS.
· Advocated, mentored and educated the various groups in the company responsible for data warehouse from source application developers to hardware and nerwork engineers on how the data warehouse affects / influences their designs.
· Developed analysis / report using proclarity.
· Developed a internal tutorial / Introductory Document to train rest of the group in MDX.
· Developed and presented BI strategy for upper management.
Business: Finance; Environment: SSIS, SSAS, Proclarity, MDX, SQL Server
Sr Data Architect, American Stock Exchange, New York, NY Sept 06 – Nov 06
Got to production a very messy data warehouse storing hybrid online / floor trading data for regulatory purposes.
The data warehouse size is over 20 Terabytes in size. For legacy reasons this data warehouse has over 150 dimensions and about 50 fact tables. Proper redesign is not an option.
Devised and developed solutions to manpower intensive problems and innumerable bugs the above design implied.
Devised an ETL process that handled the above without a staging environment.
Business: Stock Market / Regulatory; Environment: Informatica, Oracle 10g, Linuix

Founder / Architect (Part time), GFMS, Plainsboro,NJ Dec 2005 – June 2006
· Provide design, architecture and expert advice to a start up company developing a financial data warehouse and Data Mining product integrating financial market (Stocks, derivatives, bonds) data at micro (ticks, trades and orders) and macro (indexes, business cycle indicators) economic levels from numerous public and private data sources.
Provide the data warehouse data model in star schema to analyze and store the market metrics and easily calculate financial models. These include calculations and storage of historical variance – covariance metrics, calculations of implied volatility, EWMA, GARCH models of volatility, Market impact of a trade, the Greeks, theoretical Black-Scholes value of hypothetical option etc. .
Business: Finance; Environment: Informatica, Oracle 10g, Windows

Data Warehouse Architect, IDT, Newark, NJ Jan 05 – Sept 2006
· Plan and implement the strategy & methodology to replace the current ad hoc stovepipe data marts and databases in the company with an 8-terabyte-enterprise wide data warehouse.
· Responsible for the creation and enforcement of enterprise data standards within the organization.
· Mandate includes team creation, Software evaluation and purchase mentor and advocate data warehousing and data architecture within the company.
Took the company from RFI and POCs (Informatica, Data Stage and Ab Initio) to tool selection and company wide implementation. Responsible for the architecture for data management using Ab Initio.
Business: Telecom, Accounting, Customer Service; Environment: Ab Initio, Oracle 10g, Linuix

Independent Contract (Part Time), DevLogics, 300 Park Avenue, New York, Aug 2005 – Dec 2005
· As an expert consultant, make a proposal for plan, strategy and architecture to acquire, manage and analyze data for a National Political Party in preparation of presidential elections of 2008.

BI Architect, WebMD, New York, NY (5/02 – Jan 05)
· Responsible for the technical and data architecture in the development of a 2-terabyte sized Data Warehouse and the data analysis framework using SQL Server and analysis services.
Developed logical and physical data models for the integration of data from disparate corporate environments. Developed multidimensional data structures for efficient data access.
· Statistical analysis of the web usage trends for profitability, risk and revenue prediction.
· Responsible for the creation and enforcement of enterprise data standards within the organization.
Setup technical and operational production processes.
Business: Health Care / Web; Environment: DTS, SQL Server, Windows, Business Objects

Independent Contractor; Harmonic Corp, PetSmart – Business Intelligence (8/00 – 05/02)
· Responsible for the market basket analysis and predictive modeling.
· Provided architecture and the technical support for construction of few terabyte-sized data repositories on Oracle and SQL Server.
· Business: Advertising, Accounting, Web Analytics, Retail; Environment: Informatica, Oracle, Unix, Business Objects

Data Mining Engineer / Application group Manager, Vasona Systems, Campbell, USA. (1/98-8/00)
· Built a 100 giga byte sized health care Data Warehouse and the data analysis system
· Provided fast near real-time access to extensive variance tracking and analysis, utilization management trends, quality and medical outcomes.
· Created and lead a team of engineers for the Data Warehouse development.
· Used segmentation and decision trees for the healthcare fraud detection.
· Business: Health Care, Web Analytics; Environment: Informatica, Oracle, Unix, Business Objects
(Company Closed)

Advisory Systems Engineer – IBM, Dallas, TX. (12/96 – 1/98)
· Built five data repositories using IBM’s tools for their clients.
· Responsible for the support of IBM’s Data Warehouse and campaign management tools.
· Business: Banking; Environment: DB2, Unix, Business Objects
(Company wide Layoffs)

DESY, Hamburg, Germany. (6/ 93 – 8/ 95)
· Responsible for managing the data acquisition systems of a billion dollar project.

Data Warehouse Initiation Form

Aparna Chamerthi & Vijay K Nadendla

Data Warehouse Name:

Executive Sponsor:

Project Vision & Scope:

Project Priority:

Required Response Time - Order of Magnitude (1 sec, 1hour):

Resources:

Software Needed:

Top-level Requirements:

Subject Matter Experts:

Users:

User Contact:

Data Sources:

Data Sources Contacts:

Data Sources Documentation:

Financing:

ETL Template

Aparna Chamerthi & Vijay K Nadendla
ExtractData Flow:

Data Sources
Data Server
Stage
Data
Warehouse
File Sources
Data Mart
Extract
Load
Transform
Aggregate
Aggregate
Extract
Aggregate
Aggregate

· BusinessName_ETL 0
o Truncate Stage Fact table
· BusinessName_ETL 1
o Copy Dimensions & Lookups from Data Warehouse to stage
[Loop start for Real time data warehouses. Steps outside loop done once a day]
o Load input data from data sources and file sources to Stage one day (or period) at a time
· BusinessName_ETL 2
o Update the Dimension tables on stage with lookups / dimension data from data / file sources
o Find Dimensions keys and update the stage fact table (fat table). If a certain key is not present in the dimension table but a corresponding transaction present in fact table insert a skeleton record into dimension table.
o Calculate and Populate aggregate tables.
o Make and verify data audit reports
· BusinessName_ETL 3
o Copy back dimensions to Data Warehouse
o Copy back facts and aggregates to Data Warehouse
[Loop end for Real time data warehouses. Steps outside loop done once a day]
o Archive Stage
o Make and verify Data warehouse Audit Reports
· BusinessName_ETL 4
o Make aggregates on data warehouse
o Extract and aggregate to dependent Data marts
o Run further aggregates on dependent Data marts
o Populate Summary / Report tables on Data Warehouses and Data Marts
o Run the Data Extracts from the data warehouse.
· Data Warehouse Live for querying
· Back up all Data Warehouses and Data Marts (once a week)
· Clean up old data in Data Warehouses and Data Marts (periodically)
· Clean up Data Server (Periodically)

Data Warehouse Error Taxonomy

Aparna Chamerthi & Vijay K Nadendla
The code performing any changes to the source data is tagged with the error type that it fixes. This is part of ETL meta data.
· Incomplete errors
o Missing records
o Missing fields
o Records or fields that, by design, are not being recorded
· Incorrect errors
o Wrong (but sometimes right) codes
o Wrong calculations, aggregations
o Duplicate records
o Wrong information entered into source system
o Incorrect pairing of codes
· Incomprehensibility errors
o Multiple fields within one field
o Weird formatting to conserve disk space
o Unknown codes
o Spreadsheets and word processing files
o Many-to-many relationships and hierarchical files that allow multiple parents
· Inconsistency errors
o Inconsistent use of different codes
o Inconsistent meaning of a code
o Overlapping codes
o Different codes with the same meaning
o Inconsistent names and addresses.
o Inconsistent business rules
o Inconsistent aggregating
o Inconsistent grain of the most atomic information
o Inconsistent timing
o Inconsistent use of an attribute
o Inconsistent date cut-offs
o Inconsistent use of nulls, spaces, empty values, etc.
o Lack of referential integrity
o Out of synch fact data

Data Warehouse Design Standards

Aparna Chamerthi & Vijay K Nadendla

File Sources: Source pushes file to data Collector. If it pushes somewhere else it is moved by Ab Initio to Data Collector

Data Sources: Tables are copied whole. Larger tables have the incremental data copied if possible.

Data Collector: A database dimensional / small tables & MFS for fact data that is extracted from Data Sources / file Sources. Minimal transformations. One Data Collector for all Data Warehouses. Proceses in to the Data Collector are included in Ab Initio Data Collector Project.

Stage: All data is integrated & transformed only here. A single periods data is usually processed. It will contain structures similar to the data sources, the final dimensional model similar to data warehouse and intermediate tables if necessary. All dimension / look up tables are copied daily from data warehouse and reconciled with the source systems. There is a different stage per DW. Processes in and out of stage db are included in Ab Initio Stage Project.

ODS: An Operational Data Store for quick short term reporting. Not much processing is involved. In Ab Initio ODS project will include all processes coming into ODS db.

Data Warehouse: The data is stored in lowest incoming grain and in dimensional format. It contains the master tables for dimensions and lookups. Extracts and some aggregates are done on data warehouse. Processes out (and out and in) of Data Warehouse are part of DW Ab Initio Project.

Data Mart: Subject / question oriented small data repositories. Processes in and out of Data Mart are part of Data Mart Project.

ETL[AX] = Extract, Transform, Load [Aggregate, eXtract]. This is a single process that runs daily at a certain time (and for a certain time) and in that order. Processes cannot be mixed. e.g. All Extracts of a data warehouse must occur before transforms occur.

Slowly Changing Dimensions: Lastupdatetime is added to all the dimensions. If any dimension record changes a new record is inserted with new values. Old record stays as is.

Conform Dimensions: IDT’s conform dimensions are D_Date, D_Customer, D_Access_Point …

Lookup Tables: DW Group maintains the lookup tables and their values.

Facts: Multiple source data are integrated into single fact table. Usually there is only one fact table. Completely independent data with non overlapping dimensions from different events and with almost no occasion of joins between fact tables can be separated resulting in multi star schema. E.g. sales transactions and CDRs

Aggregates: Aggregates are multidimensional. They are not for a single report.

Summary Tables: Maybe built in users schema.

The Case for Data Warehouse

DW Group - Version 1.0

The need for data warehousing is mainly caused by the limitations of transaction processing systems. These limitations of transaction processing systems are not, however, inherent. That is, the limitations will not be in every implementation of a transaction processing system. Also, the limitations of transaction processing systems will vary in how crippling they are.
· To perform server/disk bound tasks associated with querying and reporting on servers/disks not used by transaction processing systems
·
· To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing
·
· To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel
·
· To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems
·
· To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only
·
· To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time
·
· To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases
A firm that expects to get business intelligence, better decision making, closeness to its customers, and competitive advantage simply by plopping down a data warehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out, usually by trial and error, how to change business practices to best use the data warehouse and then to change their business practices. And that can be harder than implementing a data warehouse.

SOW – National Political party

This I would say is a high point for me. But after considerable introductory & strategy sessions the manager there left! And everything started all over again with new people.

Project Description
General Description of Services
The NPP envisions a Voter Database system to support upcoming elections. The NPP is seeking to hire XXX to serve in a technical advisory role for approximately two months to assist in planning for the implementation of the NPP's Voter Database. XXX will provide infrastructure, database and developer subject matter expertise and support for the NPP’s Voter Database Project. The XXX team will work closely with the NPP team to develop the data model and technical architecture required for the Voter Database Project.
…………………………..
…………………………..
…………………………..

Detailed Description of Project
XXX will assist, on a time & material basis, in the data modeling project by working with the NPP and required state party staff in order to create a data model for the voter database. XXX will work with the NPP to understand the various use-case scenarios of any eventual applications and determining exactly what types of data would be stored within the system.

Additionally, XXX will work with the NPP to develop the architecture for the technical infrastructure required for the Voter Database. This includes: assistance in selecting a RDBMS, guidance on the right solution for database and application high-availability, advice on purchasing hardware and assistance implementing this infrastructure.
Consulting Services
a) XXX will provide Client with the following services (collectively, the "Consulting Services"):

· Development of a Data Model for the voter database - This will include a diagram (Visio or other), Word Document describing the Data Model elements.

· Development of the Technical Architecture - This will include a diagram (Visio or other), Word Document describing the architecture recommendations.

· Development of a High Level Project Roadmap to describe versions, phase mapping and associated budgets.

· Development of budgets for complete solution (hardware, software and services) based on recommendations by Phase.

b) XXX will provide consultants with the requisite skills necessary to assist Client with its needs:

One Senior Technical Director, Business Intelligence
……………………………………………………..
Deliverables
a) List of Deliverables. XXX will provide the Consulting Services previously described to develop and deliver to Client the following Document Deliverables:
1. Data Model Document. This document will include a diagram (Visio or other), Word Document describing the Data Model elements.

2. Technical Architecture Document - This document will include a diagram (Visio or other), Word Document describing the architecture recommendations.

3. High Level Project Roadmap Document – This document will describe versions, phase mapping and associated budgets.

4. High Level Budget Document – this document will describe the budgets for complete solution (hardware, software and services) based on recommendations by Phase.

b) Requirements for Deliverables.
1. Data Model Document. This document must describe the Data Model elements.
2. Technical Architecture Document - This document must describe the architecture recommendations.
3. High Level Project Roadmap Document – This document must describe versions, phase mapping and associated budgets.
4. High Level Budget Document – this document must describe the budgets for complete solution (hardware, software and services) based on recommendations by Phase.

Business Situations & Approaches : Examples

Customer: Billion $ Telephone Company Newark, NJ

Project Name: Enterprise Data Architecture and Data Warehouse

Business Situation:
Client wanted to replace ad hoc stove pipe data marts and application databases in the company with a more structured solution.
The current solution is manual and resulted in exploding resource requirements and dipping user satisfaction.
A massive one-time rewrite is ruled out as impractical.

Approach:

Design the ideal technical and data architecture and standards to be used though out the company.
Select best of breed software packages tools and applications as company standard.
Rewrite and move applications and data marts one by one to the new architecture making sure they integrate with existing solution.
Selection of a mid-tier storage platform to provide a consolidated storage platform for all applications and data marts.

Solution:
· Linux grid with blade based architecture to help reduce footprint of server racks and provide flexibility, expandability and redundancy. Software tools selected that work in grid environment.
· EMC SAN for storage of more than 10 TB of data.
· After full implementation this project will cut the staffing requirements by over 50%.
· The turn around time for new projects down by two thirds.
· Users are more satisfied with the accuracy and the timeliness of the data.
· Company cut down by over 80% the money not billed and collected because of the faulty systems.

Customer: Harmonic Corporation, San Francisco, CA

Project Name: Common Advertising Platform
Business Situation:
A new advertising company was focusing on building a common advertising platform for multiple media. From TV to newspapers and Internet.
Need an end-to-end solution to support the business with a new idea and requirements.
Need to integrate with the current software facilitating the advertising in various media and various vendors.
Approach:
Understand each of the existing business processes in various media and develop a business model for the common platform to support all corresponding tasks.
Understand interfaces to the existing solutions in various media that the common platform has to interface to.
Develop an object model and a relational data model to support the common business model designed above.
Develop a BI strategy to deal with multiple clients and multiple media.
Develop a technical and data infrastructure for BI and interface with common advertising platform’s infrastructure.

Solution:
This is a new application that initiated the company’s endeavors in this area.
This application was latter sold as independent solution.
Solution developed on Sun SOLARIS with EMC as storage.
Java is used to develop the applications on ORACLE. Industry best tools selected for BI.

Customer: PetSmart, Phoenix, AZ

Project Name: CRM
Business Situation:
Company had flat sales over last few years.
Increase sales by using CRM tools.
Introduce Loyalty Cards
Company had no consistent idea of its customers.

Approach:
Did business analysis to discover nearly 50 different stores of customer information. Initiated an effort to standardize the customer information collection. Designed and implemented a customer database.
Short listed CRM vendor solutions for their fit against business requirements and IT Standard and Architectures.
Selected vendor based on Proof of Concept and total score within Vendor Selection matrix tool.
Design and implement a Loyalty Card program using the CRM tools implemented.

Solution:
The CRM solution was found to increase the sales in target market by over 10%.
Company’s standardized customer database was used for several other initiatives.
Oracle on HP UNIX was used with EMC as Storage and ORACLE as database. Code was developed using PL/SQL.

Customer: WebMD, New York, NY

Project Name: BI Migration
Business Situation:
The client discovered that with the dot COM crash costs need to be cut by over 80 % to stay alive. Every system was migrated from existing platform to a simpler cheaper architecture with an entirely new group of engineers.
BI fell through the cracks. There was no solution 2 months before scheduled live date (latter postponed for unrelated reasons)
With little budget, and no existing knowledge base to draw from a crash effort is in place to rewrite the old BI solution to new platform

Approach:
Technical infrastructure to be similar to rest of solution to take advantage of existing resources.
Quick business analysis by reverse engineering requirements from the existing solution.
Simple Data Architecture following the established principles in industry
Rapid application development using two dedicated resources and a number of short-term contractors as needed.
Move requirements where necessary to next phase of project.

Solution:
Solution was ready for launch with rest of the applications.
The team size was reduced by over 75 %.
The company saved over $1 MM per year in software and infrastructure costs. The performance increased by a factor of 4.
The solution was developed on SQL server on windows 200. Hardware storage solution is home grown using RAID disk array.

The Data warehouse at Web MD

Every data warehouse is a different and learning experience. Working with vendor supplied application (SAP) or stable in house application packages as data sources different experience from working with several short and long term constantly changing data sources. Working with WebMD has expanded the domain of data warehousing for me.

Data warehousing is usually described in terms of Data sources, the extraction of data from those data sources, their transformation and integration into a logically consistent data set and further the load into the database and hence the merge with already existing data. Easy access is provided to this huge data set, something operational systems cannot do at this scale. Data warehouse is the Data, the Meta data, the software, the hardware, the processes, the procedures, the operations and policies that enable this.

WebMD is a Heath care information website with advertising and sponsor based revenue model. The information they are looking for from the data warehouse is about the user behavior on site. e.g.: top pages viewed, top pages viewed after a certain page. They look both at aggregates and in very specific events. The primary data sources are the web logs. The secondary ones are several data based that collect data based on web forms. e.g.: registration. The web logs are physically one data source. The format is one but the data structure is not. Logically web logs can be looked at as common format interface to several disparate data sources. Each of the fields of web log is just a collection of several different other fields e.g. a query string may contain several dozen fields passed to us as name value pairs. These logical data sources are constantly changing as the web site is ever evolving with added functionality or changing business partnerships. The set of use cases of a Point of Sale data source is known. The set of use cases of a web site is unknown and changing. The generalized analysis requirements of the former can be known. For the latter its impossible. Our data warehouse is built to accommodate these fast changes. It is very flexible while retaining the ease, scale and access of the other data warehouses I worked on.

Our extraction process consisted of several data movement scripts, a c# parser which reformats the input data source and does some basic transformations, A number of sql based stored procedures performing aggregations and transformations managed by SQL Server’s DTS (It is not strictly an ETL tool which automates code development. Rather it provides a management shell where you write your own code). Loading is managed by DTS. The database is SQL server which is not common for a 2 tera byte data warehouse and that restricts us to me smart in our data model and the way we manage the data bases that make up the data warehouse. Data access is provided by Business Objects tool. Business objects also provides a layer of insulation for analysis from the data model. We can change the data model without affecting the validity of the hundreds of analyses on it.

Over time our role and understanding of the data warehouse has expanded to include in addition to requirement analysis 1) to understand the business problem being solved at the application requirements level, to make sure that the application developers are developing applications and recording data in a way that further down the line the data is accurate and logically consistent at the data warehouse 2) to understand the end user analysis and analytics and to make sure that their analysis is consistent when our data changes. The first role is that of an enterprise data architect and the second of a data liaison with the Analytics and user community.
I have worked in several full life cycle data warehouses before from conception of data warehouse to its actual use in production. This data warehouse redefines the concept to include data from conception to use. Being part of a small team of four, and the senior most member of the group I have worked on almost every aspect of the data warehouse hands on, either in coding or in helping others code or in checking their code

Data Warehouse Migration (Draft)

Aparna Chamerthi & Vijaya K Nadendla

Introduction

Into the second decade of the arrival of the data warehouses, we are realizing that we face the same problems as any other IT initiative mainly related to the end of the product lifecycle. We have to gracefully retire old data warehouses and make the case all over again for new ones. We may have to migrate some of the old data into the new data warehouse. We may also have to migrate the reporting, OLAP or any other BI solutions on top of the data warehouse to new platforms.

In this article we will discuss when to migrate a data warehouse, what are the main problems faced in migrating a data warehouse, how to calculate the ROI for the migration and main technical and scope issues. We will also briefly touch the staffing issues and success criteria.

What is a Data warehouse?
Although formally around for just over a decade, similar initiatives existed for a couple of more decades. As a new and evolving field, our understanding and expectations of it has changed over time. The lack of a strong theoretical underpinning to data warehouse as opposed to the relational databases resulted in adhoc approached based on experience. Hence let us define the data warehouse based on what we expect it to do – it could be one or more of the following.
A repository of data
A source of reports on general performance of business.
A source of data for statistical analysis
A platform for standardization and integration of data
A source of data for CRM, 1-1 marketing, and target marketing.
A data store (the only one incase of web and some scientific fields)
Data structures designed for quick, understandable and easy access to data.

Defining it on what its components are will give us the following list
General understanding, processes and documentation of
a) Various sources of data and their data access mechanisms.
b) Various uses of data
c) Logical and physical platform and procedures for the standardization of data
d) Hardware and software for the data storage, archival, backup and recovery.
e) Hardware and software for data access, including networking & reporting
f) Data quality audit processes
g) Operational processes
h) Meta data cataloging and delivary
i) Interfaces with CRM, Data mining and statistical analysis processes.
j) Interfaces with other data warehouses
k) Security processes and procedures
Thje above contains the elements of many data warehouses. A single data warehouse can contain only some of the above. Some of the above may not be precisely defined. The most common components in the data warehouse are c, d and e.

What is data warehouse migration?
We take data warehouse migration to mean any substantial change in c, d or e. This will cause change in some of all of the other components of the data warehouse listed above. A change of hardware for data storage in by itself will not be called a data warehouse migration.
What to migrate
Migrate the meaning of the data. Not just the data. Migrate the contenet and the context.
Given the volatility of the dta and the businesses how long would the data be useful
How do the applications look at the data.

When to Migrate
• Old system breakdown
• Substantial Data Source changes
• DW Simplification and Reorganization
• Cost reduction
• Technology changes
• Business Process Change
• Business Reorganization
• Performance
Problems
• Employee Reluctance
– Different skill sets
– Job Security
• Transition Period Costs
• Parallel development
• Organizational Inertia
• Downstream Systems
Migrate vs. Modify
• Modification is first preference
• Doesn’t work always
– Increases complexity
– May be impossible
Scope – increase or decrease
Technical Architecture & software reuse
Staffing
Success Criteria

Starting a Data Management Process

Aparna Chamerthi & Vijaya K Nadendla

My immediate step would be to do a data inventory, take a quick pass through the data acquisition, storage, management and access and put in place minimum security, storage and archival policies and procedures in place to prevent disaster. However it is likely that these procedures exist.

The next step would be to start up the data management initiative to protect the corporate data assets. Data enables and records business activity and measures business processes. Data enables technology to provide applications to support the business and business processes. Data enables organizations to make informed business decisions. Any data man agent strategy has to start with understanding these different aspects. Each enterprise is different and one has to be flexible to adapt to different situations. In general, I would follow these steps some of them concurrent.
o Identify the major business areas and their relationships with one another. Identify the major business objects within each business area, how these objects relate to one another, and how each object is virtually represented in terms of data.
o Identify the enterprise business processes contained within each business area and their relationships with one another. Define the major tasks of each business process and its relationship to other processes. Identify the individual pieces of data that are used in support of the execution of a business process and measures the process’ efficiency.
o Understand the customers of the company their interaction with the company and the business processes that support the interaction.
o Identify and understand the various applications & databases that support the business processes and the business data. Define what each piece of data represents, with which business areas/processes it is associated, the format in which the data is to be expressed, and how the data is named (tagged).
o Understand how data is used within the company to support business decisions ranging from accounting to marketing to CRM. Understand the detail and accuracy to which the data is needed.
o Understand (or get the user to articulate) the business needs in terms of business value. Understand the business drivers. Combine differing "visions" of the company into an integrated whole that conforms to the company's stated mission. Identify opportunities to support that mission, those needs and those practices through technology.
o Understand the existing technological and business standards and polices. Assert "best practices" among the company's peers and lead a discussion of which practices are appropriate to the company. Campaign for Data Management standards within the company including educational opportunities, talks, discussions, newsletters and emphasis by high-level management. Develop a reasonable consensus on the appropriate data management strategies, policies and practices both in business and in IT.
o Define the mid term goal or ‘end state’ of the data management strategy which can be different from the optimal state. (People should know what is the goal state we are achieving and that it is possible.). Understand business and technological change and incorporate that into the Data Warehouse strategy. Understand the risks, ROI and prioritize the tasks. Develop a road map to go from existing data management state (which is the default state in absence of well defined strategy) to the goal state making allowance for changes along the way.

The Data in Warehouse

This summary is not available. Please click here to view the post.

Wednesday, February 21, 2007

Monday, February 19, 2007

Sunday, February 18, 2007

Blog Archive