I am often asked usually by programmers - What is Data Warehousing & how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession accuracy and meaning.

Sunday, February 18, 2007

Data Warehouse FAQ (Draft)

1. What is a data warehouse?

A data warehouse will be defined to be the data, software, hardware, policies, processes, tasks and documents that comprise
· The Data warehouse database
o Different disks
o Different data structures
o Different hardware architecture
o Ease of use
o Cleaned data
o Integrated data from many systems
o Longer non volatile storage
o Security
· Business Analysis
· ETL
· Derived data and Aggregations
· The data Cleaning Mechanisms
· Data Quality Assurance Processes
· Data Access mechanisms including data mining tools
· Operations, management and maintenance
· Data Extracts & other interfaces for other systems

2. Why do you build a data warehouse?
The need for data warehousing is mainly caused by the limitations of transaction processing systems.
· Different disks: Separate out querying and reporting from different systems / disks from transaction processing systems.
· Different data structures: Structure your data in database or disk in such a way that querying and reporting is fast but such structure is not appropriate for transaction processing.
· Different hardware architecture: Use different hardware the architecture not needed for transaction processing.
· Ease of use: To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel.
· Cleaned data: To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems
· Integrated data from many systems: To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only
· Longer storage: To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time
· Security: To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases
3. What are the differences between OLAP and OLTP systems / processes?

OLTP: On-Line Transaction Processing
–Many short transactions (queries + updates)
–Examples:
•Update account balance
•Enroll in course
•Add book to shopping cart
–Queries touch small amounts of data (one record or a few records) and are predictable.
–Updates are frequent
–Concurrency is biggest performance concern
OLAP: On-Line Analytical Processing
–Long transactions, complex queries
–Examples:
•Report total sales for each department in each month
•Identify top-selling books
•Count classes with fewer than 10 students
–Queries touch large amounts of data and are un predictable.
–Updates are infrequent – almost non existent
–Individual queries can require lots of resources
4. OLAP, Data warehouse, Data Mart, Dimensional Database, Star Schema, Cubes are sometimes used interchangeably. Sort out the differences.
OLAP: Online analytical processing. OLAP is a method of data possessing supporting analytical requirements as opposed to transaction requirements supported by OLTP. It encompasses all of the above.
Data warehouse: A data warehouse is a specific way of supporting OLAP requirements.
Data Mart: A data mart is a specific way of supporting OLAP requirements. It is similar to Data Warehouse but focused to a particular subject / set of questions.
Dimensional Database:
Star Schema: Describes the Structure of the Data Warehouse / Data Mart data in relational database. It is a relational implementation of a dimensional model.

5. What is dimensional Model.
A form of data modeling that facilitates OLAP style analytical query building. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. In this form of modeling, the emphasis is not on normalizing the data but more on ease of query building.
6. What is a dimension?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). The dimension tables are completely orthogonal – ie dimension tables join to fact table(s) but not to each other. Dimension (dimension tables) can contain hierarchies of variables. Each value a dimension tables is represented by a row in the dimension table. The row holds Surrogate key, Natural key / application data source key, attributes representing the value and meta data.
7. What is a dimension hierarchy?
Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). A single variable can be represented at different levels –e.g. date dimension can be represented by day, week, month, quarter year etc and the levels can be fully overlapping a more finer representation i.e. month encompasses day values. These are dimension hierarchies. Some dimensions have multiple hierarchies based on them. (A ragged dimension is when certain levels are missing for certain values of dimension). Dimension hierarchies facilitate rollup, drill down, slicking and dicing.
In a customer dimension data at the Customers level is aggregated into the Cities level, which, in turn, is aggregated into the Countries/Areas, Continents/Regions, and Global levels.
8. What is fact?
Dimensional data models are most common for data warehouses. The model is very simple compared with the traditional ER model because the access paths are un predictable. Dimensional models have two kinds of entities – one or few fact tables (that holds the business measures) and many dimension tables (holds variables that describe the business and by which we can analyze the measures). Facts are used to analyze the business for KPI, opportunities identification, informed decision making, management decisions, prediction (e.g. trending).

9. What are different types of dimensions? Explain? (or What is SCD / Slowly Changing Dimensions, Type 1 dimensions, Type 2 dimensions)
One way to categorize dimensions is by the way the changes to the values of the dimension variables are handled in the data warehouse. If the changed values overwrite the current values then its Type 1 Dimension. If the history of the value is maintained by using a current key / flag and / or data range when the dimension is valid it is called Type 2 dimension. A very infrequently used type 3 dimension adds fields to the same record to store newer values. (Infrequent since this is hard to do in relational databases)

10. What is a Surrogate key?
According to the Webster’s Unabridged Dictionary, a surrogate is an "artificial or synthetic product that is used as a substitute for a natural product." A surrogate key in a data warehouse is an artificial or synthetic key that is used as a substitute for a natural key.
Why
a. Multiple systems can have the same value of natural key representing different values of a dimension variable
b. Source system values may change
c. Source system may not have a well-defined / accurate natural / primary key.
d. For performance reasons as Natural key / source system primary key may not be an integer.
11. What is metadata?
Metadata is “data about data”. Warehouse metadata is descriptive about warehouse data and the process used in creating the warehouse. Metadata is the key to understanding the warehouse. It helps you to locate, manage, and use information.

Business Metadata: Business Metadata consists of Business concepts (such as organizational structure or business Resources Involved:) that will decide how the data is captured, or how the information is presented. It also includes information about the business processes supported by the data, their contacts. Business Metadata should be associated with technical metadata.

Technical Metadata: Technical metadata includes architectural, or developmental information about the location and formats and meanings of the data as it is processed and stored in the various modules of the data warehouse and reports.

Operational Metadata: Run Statistics, Data Quality statistics, Job History – the information about the actual operations of the data warehouse is the operational metadata.


12. What is Natural key / application data source key?

13. What is star schema?

14. What is snowflake schema?
15. What is degenerate dimension
16. What is factless fact
17. What are different ways of loading fact tables. Explain (snapshot, drift)
18. What is a stage? Why do you need stage
19. .Tell me about cubes
20. Full process or incremental
21. Are you good with data cleansing?
22. How do you handle changing dimensions?
23. Talk about the Kimball vs. Inmon approaches.
24. Talk about the concepts of ODS and information factory.
25. Talk about challenges of real-time load processing vs. batch.
26. Know the difference between Logical and Physical models.
27. Know how to use the Reverse Engineer and Comparison features.
28. The dimension model feature is pretty weak, but you might want to know how Erwin treats dimensional modeling.
29. Source target mapping
30. What is source qualifier?
31. Difference between DSS & OLTP?
32. Explain grouped cross tab?
33. Hierarchy of DWH?
34. How many repositories can we create in Informatica?
35. What is surrogate key?
36. What is difference between Mapplet and reusable transformation?
37. What is aggregate awareness?
38. Explain reference cursor?
39. What are parallel querys and query hints?
40. DWH architecture?
41. What are cursors?
42. Advantages of de normalized data?
43. What is operational data source (ODS)?
44. What is meta data and system catalog?
45. What is factless fact schema?
46. What is confirmed dimension?
47. What is the capacity of power cube?
48. Difference between PowerPlay transformer and power play reports?
49. What is IQD file?
50. What is Cognos script editor?
51. What is difference macros and prompts?
52. What is power play plug in?
53. Which kind of index is preferred in DWH?
54. What is hash partition?
55. What is DTM session?
56. How can you define a transformation? What are different types of transformations in Informatica?
57. What is mapplet?
58. What is query panel?
59. What is a look up function? What is default transformation for the look up function?
60. What is difference between a connected look up and unconnected look up?
61. What is staging area?
62. What is data merging, data cleansing and sampling?
63. What is up date strategy and what are th options for update strategy?
64. OLAP architecture?
65. What is subject area?
66. Why do we use DSS database for OLAP tools?

67. Business Objects Popular Q&A:
68. What is a universe?
69. Analysis in business objects?
70. Who launches the supervisor product in BO for first time?
71. How can you check the universe?
72. What are universe parameters?
73. Types of universes in business objects?
74. What is security domain in BO?
75. Where will you find the address of repository in BO?
76. What is broad cast agent?
77. In BO 4.1 version what is the alternative name for broadcast agent?
78. What services the broadcast agent offers on the server side?
79. How can you access your repository with different user profiles?
80. How many built-in objects are created in BO repository?
81. What are alertors in BO?
82. What are different types of saving options in web intelligence?
83. What is batch processing in BO?
84. How can you first report in BO by using broadcast agent?
85. Can we take report on Excel in BO?
86. What is KPI