Data Warehouse (DWH) Concepts

Data Warehouse (DWH)

Data Marts

Dimensions (d/n)

Facts

Incremental Loading

Business Intelligence

Data Warehouse (DWH)

* is a s/m used for reporting & data analysis [1]
* considered a core component of BI [1]
* central repositories of integrated data from one or more disparate sources. [1]


Data Integration (DI)
* DI is the process of combining heterogeneous data sources into a single queriable schema so as to an unified view of these data. [1]

why DI is required
* often large companies and enterprises maintain separate departmental d/bs to store the data pertaining to specific department. Although such seperations of the data provides them better manageability and security, performing any cross departmental analysis on these datasets becomes impossible.

how integration can be done
* 2 major approaches for DI is
        * Tight Coupling : Data Warehousing
        * Loose Coupling : Virtual Mediated Schema

https://d2fvxlvmm9ujfq.cloudfront.net/di/Data+Integration.pdf

why do v need staging area during ETL load [1]
"we have a simple DWH that takes data from few RDBMS sources systems and load the data in dimensions and facts tables of DWH. I wonder why we have a staging layer in between. Why can't we process everything on the fly and push then in DWH".
    staging area is not a necessity if we can handle it on the fly. Few reasons why v can't avoid a staging area:
        * source systems r only available for extraction during a specific time slot which is generally lesser than your overall data loading time.
            * Its good idea to extract and keep things at your end before you lose the connection to the source s/ys
        * u want to extract data based on some conditions which require u to join two or more different s/ys together.
        * various source s/ys hav different allotted timings for data exaction.
        * DWH data loading frequency does not match with the refresh frequencies of the source s/ys.
        * ETL process involves complex data t/ns that require extra space to temporarily stage the data
        * There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations

What is a Staging Area and What is the purpose of Staging Area [1]
Benefits of using Staging Database while designing Data Warehouse [1]
What is staging Area or stg table load in Informatica? [1]
Data Staging Area [1]



Top 10 things you must know before designing a data warehouse [1]
1. ETL solution takes more time to design than analytical solutions
2. DWH scope will increase along the development
3. Issues will be discovered in the source system that went undetected till date
4. u will need to validate data not being validated in source s/ms
5. user training will not be sufficient and user will not put their training in use
6. users will create conflicting business rules
7. volumetric mis-judgement is more common than you thought
8. It is ultimately IT's responsibility to prove the correctness of data
9. DWH project incur high maintenance cost
10. amount of time needed to refresh ur DWH is going to be your top concern

Incremental Loading
Methods of Incremental Loading in Data Warehouse
[1]
* incremental loading a.k.a delta loading is an widely used method to load data in DWH from respective source s/ms
* in almost all the data ware housing scenarios, we extract data from one or more source s/ms and keep storing them in DWH for future analysis.
* the source s/ms are generally OLTP s/ms which store everyday transactional data
* when it comes to loading these transactional data to dwh, v hav 2 approaches:
    * Full Load
    * Incremental Load

Incremental Loading for Dimension Table [1]
Should we do incremental loading for dimensions?
In a dimensional model, we may perform incremental loading for dimension tables also. One may argue that this wont be necessary as data volume in dimension tables are not as high as the data volumes in the fact tables, hence we can simply do a full load every time.

I personally do not agree to this argument. This is because during the last few years I have seen tremendous growth in the data in dimension tables and things can get quite heavy especially if we are trying to load SCD type 2 dimensions. Anyway, without much ado, let's delve deep.

Standard Method of Loading
Like before, for our purpose we will assume we have the below customer table in our source system from where we need to perform the data loading.

CustomerID  CustomerName  Type         LastUpdatedDate
1           John          Individual   22-Mar-2012
2           Ryan          Individual   22-Mar-2012
3           Bakers'       Corporate    23-Mar-2012


Incremental Loading for Fact Tables [1]

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Tight Coupling [1]
* in case of tight coupling approach - which is often implemented through data warehousing, data is pulled over from disparate sources into a single physical location through ETL process.
* this approach is called tight coupling as data is tightly coupled with the physical repository at the time of query.


*


Business Intelligence
Question :
1. why can not we use transactional systems for business intelligence.








Books:
[sanb1] [http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/The Data Warehouse Toolkit, 3rd Edition
[sanb2] [https://www.mhprofessional.com/9780071744324-usa-star-schema-the-complete-reference-group] Star Schema The Complete Reference

References:

http://www.rpi.edu/datawarehouse/dw-about.html
[http://holowczak.com/introduction-data-warehousing/]
[http://www.1keydata.com/datawarehousing/datawarehouse.html]
[http://dwhlaureate.blogspot.in/2012/11/phases-involved-in-etl-project.html]
https://dwbi.org/