DataSense

I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Wednesday, February 21, 2007

BI in its own network domain?

A very significant piece of effort into BI initiatives go into putting up proper infrastructure in place to handle massive flows of data. Further the security and other requirements are different. It is not uncommon to see BI Hardware being shoe horned into current Network and Hardware setup, with CORP, DeV, QA and Prod thus forcing BI to deal with the security and other policies that are alien to it. e.g.: Any tool installation requires corp approval, handling and management. They don't have the skillset to install, maintain or manage the tool.

A way out is to create a BI domain and place a firewall around it but let policies inside the domain to be more lax.

Monday, February 19, 2007

How many dimensions should I have in my Data Warehouse?

I can tell how many you should not have - 25, 50 ,100 or more. Because I refuse to believe there are 100 significant entities by which you are analyzing your business. To understand this lets start off with the definition of a dimension.

A dimension is an ENTITY by which you analyze your business. Even 6 months back I used to believe a dimension is a Variable by which you analyze your business. Then I happened to work on projects with more than 100 dimensions tables. I realized the scope for misunderstanding in the definition.

What is an Entity and what is a variable or an attribute? A customer is an entity and a customer name is an attribute. Customer state, Customer Zip, Customer Type are some more attributes of same entity. The Dimension is Customer. You'll have 100 dimensions in your data warehouse only if you create a dimension for each attribute - e.g. Dim_Customer_Type. Or Dim_Tick_Type. Take your 100 + dimensions and make entities out of them. You'll end up with 10 -15 entities or less. The name of the game is to get to right level of abstraction.

When you have 100 + dimensions it becomes impossible to develop build and manage the system. There is no ROI. Your resources don't increase just because you need them. YOur resources are what you can afford. Insted of beating your resources to the ground whether people or machines, proper design and a simple data warehouse that satisfies the requirements should be built.

Sunday, February 18, 2007

What is a BIG data warehouse?

When is a data warehouse BIG? Number of tables? A couple of Hundred Dimension tables and few dozen fact tables. That defeats the very purpose of data warehousing? A data warehouse must have as many dimensions as you want to anyalyze the business by. And no one thinks in 150 dimension space. It must have as many fact tables as there are completely independent transactions - when they are of completely different grain & location in dimension space that they cannot be put together.

Is it then a large fact table? Do you really want to store and analyze on fly data at such lowest of low grain? The transaction should be at a grain you want to analyze the business by.

A data warehouse is BIG when it is at it is BIG at its right size. Otherwise its just a BIG Mess.

Logic Dependency in Business

ENTERPRISE DATA Architecture Standards

Examples of Data Architecture standards to aid in standards identification..These are not proposals but rather a list of standards in use in other Organizations.

Data Architecture
Principle: 1 Design the enterprise Data Architecture so it increases and facilitates the sharing of data across the enterprise.
q Sharing of data greatly reduces data entry and maintenance efforts.
q Data sharing requires an established infrastructure for widespread data access. This includes integration with the Application, Componentware, Integration, Messaging, Network, and Platform Architectures.
q Consistent shared data definitions ensure data accuracy, integrity, and consistency.
q Data sharing reduces the overall resources required to maintain data across the enterprise.

Data Architecture
Principle: 2 Create and maintain roles and responsibilities within the distributed enterprise Data Architecture to facilitate the management of data. This requires a working relationship between the business user organizations and information services(IS).
Business responsibilities are to:
q Provide accurate business definitions of data.
q Develop enterprise-wide business views of shared data.
q Provide business drivers to support centralized data administration.
q Make metadata available.
q Define security requirements for data.
IS Responsibilities are to provide a robust technical infrastructure that includes:
q Open, accessible, and adaptable database management systems (DBMSs).
q Centralized data administration.
q Data replication facilities.
q Backup and recovery.
q Security.
q Database monitoring tools.
q Data quality monitoring tools.
q Application mechanisms for helping to ensure accurate data input.

Metadata
Principle: 3 When designing or modifying a database, review the Metadata Repository for existing standard and proposed data elements before implementing a new database to ensure data elements are defined according to Metadata Repository standards.
Design reviews are essential to ensure that shared firmwide data is defined consistently across all applications. Design reviews also determine whether data that already exists is consistently defined and not redundantly stored. Design reviews should document the following:
q Where is this application getting its data?
q What other applications are getting data from this application?
q Is data used by this application defined consistently with firmwide definitions? If not, is there a plan to define the data according to enterprise definitions?
q A design review evaluates the data requirements of a project and identifies the following:
q A data requirement that can be solved by using existing metadata element.
q Data not already identified as metadata must be proposed as an inter-agency or firmwide standard to the Metadata Element Review Team to become metadata.
q Access is available for application development projects to reference the metadata repository in order to actively research data requirements. Review the existing standard and proposed data
elements in the metadata repository before implementing a new database to ensure data elements are defined according to standards.
q Key information about data is stored in the systems that are already implemented in the firm. If possible, evaluate existing systems to propose firmwide data elements.

Data Modeling
Principle: 4 Take the Entity-Relation (ER) model to the third normal form, then denormalize where necessary for performance.
q The third normal form is the most commonly recommended form for the ER model.
q In some cases, a denormalized database can perform faster as there can be fewer joins, or reduced access to multiple tables. This process saves both physical and logical input and output requirements.

Data Modeling
Principle: 5 Restrict free form data entry where possible.
q In the design phase, consider the values that may be input into a field. These values or domains should be normalized so that data is consistent across records or instances. For example, using consistent values for gender or address information.
q Use look-up tables and automate data entry for column or attribute domain values to restrict what is entered in a column.

Data Access Implementation
Principle: 6 Validate data at every practical level to ensure data quality and avoid unnecessary network traffic.
q Validation can be coded into multiple tiers of the n-tier architecture to ensure that only valid data is processed and sent across the network. For example, an invalid field entered in a data entry form can be corrected before data is written to the database.
q Data integrity verification rules should be used when possible.

Data Access Implementation
Principle: 7 Design the data access infrastructure to support the transparency of the location and access of data by each application.
q This means designing an N-tier architecture where all data access is managed through a middle tier. This design makes databases easy to relocate, restructure, or re-platform the back end services with minimal disruption to the applications that use them. It is essential for adaptive systems..
q A client should not send SQL requests directly to a server. Instead of using SQL code, the client should communicate with the database through data access rules. The application receives a request from a client and sends a message to the data access rule. The data access rule sends an SQL call to the database. With this method, the client does not send SQL to the server, it sends a request for work.

Data Access Implementation
Principle: 8 For data quality management, implement tools, methods, processes and policies to provide high-level data accuracy and consistency across distributed platforms.
q Both business users and Information Technology (IT) staff are responsible for data accuracy and consistency. Policies and procedures must be established to ensure the accuracy of data.
q IT staff is responsible for and must provide security mechanisms to safeguard all data under IT control. The business users must determine functional security requirements, while the physical security must be provided by IT.
q Applied systems management provides safeguards against data loss and corruption and provides the means of recovering data after system failures. This implies that effective backup and recovery systems are imperative and that data can be recovered in a timely basis regardless of the cause of loss.
q For critical functions, plan for survivability under both normal operations and degraded operations.

Data Security
Principle: 9 Record information about users and their connections as they update and delete data. Auditing can determine who updated a record and their connection data.
The information that can be captured by the application includes:
q The user account the user logged in with.
q The TCP/IP address the connected user's workstation.
q The certificate information (if using certificates) about that user.
q The old values that were stored in the record(s) before the modification.
q The new values that were input to the record(s).

Data Security
Principle: 10 Protect database servers from hardware failures and physical OS
attacks.
q Database servers must be located in a climate-controlled, restricted-access facility, and preferably a fully staffed data center. Uninterruptible power supplies (UPSs), redundant disks, fans, and power supplies must be used.

Data Warehouse
Principle: 11 Perform benchmarks on the database design before constructing the database.
q Expect to make changes and adjustments throughout development.
q Changes during the early cycles up to, and including implementation, are a primary mechanism of performance tuning.

Data Hygiene Tools
Principle: 12 Ensure data entry quality is built into new and existing application systems to reduce the risk of inaccurate or misleading data in OLTP systems and to reduce the need for data hygiene.
q Provide well-designed data-entry services that are easy to use (e.g., a GUI front end with selection lists for standard data elements like text descriptions, product numbers, etc.).
q The services should also restrict the values of common elements to conform to data hygiene rules.
q The system should be designed to reject invalid data elements and to assist the end user in correcting the entry.
q All updates to an authoritative source OLTP database should occur using the business rules that own the data, not by direct access to the database.
q Attention to detail should be recognized and rewarded.

Data Warehouse FAQ (Draft)

1. What is a data warehouse?

A data warehouse will be defined to be the data, software, hardware, policies, processes, tasks and documents that comprise
· The Data warehouse database
o Different disks
o Different data structures
o Different hardware architecture
o Ease of use
o Cleaned data
o Integrated data from many systems
o Longer non volatile storage
o Security
· Business Analysis
· ETL
· Derived data and Aggregations
· The data Cleaning Mechanisms
· Data Quality Assurance Processes
· Data Access mechanisms including data mining tools
· Operations, management and maintenance
· Data Extracts & other interfaces for other systems

2. Why do you build a data warehouse?
The need for data warehousing is mainly caused by the limitations of transaction processing systems.
· Different disks: Separate out querying and reporting from different systems / disks from transaction processing systems.
· Different data structures: Structure your data in database or disk in such a way that querying and reporting is fast but such structure is not appropriate for transaction processing.
· Different hardware architecture: Use different hardware the architecture not needed for transaction processing.
· Ease of use: To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel.
· Cleaned data: To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems
· Integrated data from many systems: To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only
· Longer storage: To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time
· Security: To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases
3. What are the differences between OLAP and OLTP systems / processes?

OLTP: On-Line Transaction Processing
–Many short transactions (queries + updates)
–Examples:
•Update account balance
•Enroll in course
•Add book to shopping cart
–Queries touch small amounts of data (one record or a few records) and are predictable.
–Updates are frequent
–Concurrency is biggest performance concern
OLAP: On-Line Analytical Processing
–Long transactions, complex queries
–Examples:
•Report total sales for each department in each month
•Identify top-selling books
•Count classes with fewer than 10 students
–Queries touch large amounts of data and are un predictable.
–Updates are infrequent – almost non existent
–Individual queries can require lots of resources
4. OLAP, Data warehouse, Data Mart, Dimensional Database, Star Schema, Cubes are sometimes used interchangeably. Sort out the differences.
OLAP: Online analytical processing. OLAP is a method of data possessing supporting analytical requirements as opposed to transaction requirements supported by OLTP. It encompasses all of the above.
Data warehouse: A data warehouse is a specific way of supporting OLAP requirements.
Data Mart: A data mart is a specific way of supporting OLAP requirements. It is similar to Data Warehouse but focused to a particular subject / set of questions.
Dimensional Database:
Star Schema: Describes the Structure of the Data Warehouse / Data Mart data in relational database. It is a relational implementation of a dimensional model.

5. What is dimensional Model.
A form of data modeling that facilitates OLAP style analytical query building. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. In this form of modeling, the emphasis is not on normalizing the data but more on ease of query building.
6. What is a dimension?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. Dimension (dimension tables) can contain hierarchies of variables. Each value a dimension tables is represented by a row in the dimension table. The row holds Surrogate key, Natural key / application data source key, attributes representing the value and meta data.
7. What is a dimension hierarchy?
Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). A single variable can be represented at different levels –e.g. date dimension can be represented by day, week, month, quarter year etc and the levels can be fully overlapping a more finer representation i.e. month encompasses day values. These are dimension hierarchies. Some dimensions have multiple hierarchies based on them. (A ragged dimension is when certain levels are missing for certain values of dimension). Dimension hierarchies facilitate rollup, drill down, slicking and dicing.
In a customer dimension data at the Customers level is aggregated into the Cities level, which, in turn, is aggregated into the Countries/Areas, Continents/Regions, and Global levels.
8. What is fact?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). Facts are used to analyze the business for KPI, opportunities identification, informed decision making, management decisions, prediction (e.g. trending).

9. What are different types of dimensions? Explain? (or What is SCD / Slowly Changing Dimensions, Type 1 dimensions, Type 2 dimensions)
One way to categorize dimensions is by the way the changes to the values of the dimension variables are handled in the data warehouse. If the changed values overwrite the current values then its Type 1 Dimension. If the history of the value is maintained by using a current key / flag and / or data range when the dimension is valid it is called Type 2 dimension. A very infrequently used type 3 dimension adds fields to the same record to store newer values. (Infrequent since this is hard to do in relational databases)

10. What is a Surrogate key?
According to the Webster’s Unabridged Dictionary, a surrogate is an "artificial or synthetic product that is used as a substitute for a natural product." A surrogate key in a data warehouse is an artificial or synthetic key that is used as a substitute for a natural key.
Why
a. Multiple systems can have the same value of natural key representing different values of a dimension variable
b. Source system values may change
c. Source system may not have a well-defined / accurate natural / primary key.
d. For performance reasons as Natural key / source system primary key may not be an integer.
11. What is metadata?
Metadata is “data about data”. Warehouse metadata is descriptive about warehouse data and the process used in creating the warehouse. Metadata is the key to understanding the warehouse. It helps you to locate, manage, and use information.

Business Metadata: Business Metadata consists of Business concepts (such as organizational structure or business Resources Involved:) that will decide how the data is captured, or how the information is presented. It also includes information about the business processes supported by the data, their contacts. Business Metadata should be associated with technical metadata.

Technical Metadata: Technical metadata includes architectural, or developmental information about the location and formats and meanings of the data as it is processed and stored in the various modules of the data warehouse and reports.

Operational Metadata: Run Statistics, Data Quality statistics, Job History – the information about the actual operations of the data warehouse is the operational metadata.

12. What is Natural key / application data source key?

13. What is star schema?

14. What is snowflake schema?
15. What is degenerate dimension
16. What is factless fact
17. What are different ways of loading fact tables. Explain (snapshot, drift)
18. What is a stage? Why do you need stage
19. .Tell me about cubes
20. Full process or incremental
21. Are you good with data cleansing?
22. How do you handle changing dimensions?
23. Talk about the Kimball vs. Inmon approaches.
24. Talk about the concepts of ODS and information factory.
25. Talk about challenges of real-time load processing vs. batch.
26. Know the difference between Logical and Physical models.
27. Know how to use the Reverse Engineer and Comparison features.
28. The dimension model feature is pretty weak, but you might want to know how Erwin treats dimensional modeling.
29. Source target mapping
30. What is source qualifier?
31. Difference between DSS & OLTP?
32. Explain grouped cross tab?
33. Hierarchy of DWH?
34. How many repositories can we create in Informatica?
35. What is surrogate key?
36. What is difference between Mapplet and reusable transformation?
37. What is aggregate awareness?
38. Explain reference cursor?
39. What are parallel querys and query hints?
40. DWH architecture?
41. What are cursors?
42. Advantages of de normalized data?
43. What is operational data source (ODS)?
44. What is meta data and system catalog?
45. What is factless fact schema?
46. What is confirmed dimension?
47. What is the capacity of power cube?
48. Difference between PowerPlay transformer and power play reports?
49. What is IQD file?
50. What is Cognos script editor?
51. What is difference macros and prompts?
52. What is power play plug in?
53. Which kind of index is preferred in DWH?
54. What is hash partition?
55. What is DTM session?
56. How can you define a transformation? What are different types of transformations in Informatica?
57. What is mapplet?
58. What is query panel?
59. What is a look up function? What is default transformation for the look up function?
60. What is difference between a connected look up and unconnected look up?
61. What is staging area?
62. What is data merging, data cleansing and sampling?
63. What is up date strategy and what are th options for update strategy?
64. OLAP architecture?
65. What is subject area?
66. Why do we use DSS database for OLAP tools?

67. Business Objects Popular Q&A:
68. What is a universe?
69. Analysis in business objects?
70. Who launches the supervisor product in BO for first time?
71. How can you check the universe?
72. What are universe parameters?
73. Types of universes in business objects?
74. What is security domain in BO?
75. Where will you find the address of repository in BO?
76. What is broad cast agent?
77. In BO 4.1 version what is the alternative name for broadcast agent?
78. What services the broadcast agent offers on the server side?
79. How can you access your repository with different user profiles?
80. How many built-in objects are created in BO repository?
81. What are alertors in BO?
82. What are different types of saving options in web intelligence?
83. What is batch processing in BO?
84. How can you first report in BO by using broadcast agent?
85. Can we take report on Excel in BO?
86. What is KPI