OLAP And OLTP - The Transactional and Analytical...

OLTP - Online Transactional Processing
OLAP - Online Analytical Processing

We can divide the Datawarehouse storage into 2 sections, i.e, transactional (OLTP) and analytical (OLAP). 

As the name suggests, OLTP stores the transactional data and OLAP is used to store the analytical data. We can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

Let us take an example for explaining this point:

While discussing the concept of staging area, we took an example of a telecom company datawarehouse.

In that case, suppose you are a mobile phone user, all the individual transaction you do become source for ultimately loading the Datawarehouse.
The individual action (like making a call, sending an SMS, sending an MMS etc) becomes a transaction, which will be stored in an OLTP system.

And again as discussed in Staging area post, after 1 month the data in OLTP systems will be summarized and loaded into the OLAP systems.

Ironically, staging area and OLTP systems are nothing the same. They store the transactional data for a specific interval of time and than after summation the data is loaded in OLAP (that's Datawarehouse) where the data seat's for analysis purpose.




Fig, 1(a)




-OLTP (On-line Transaction Processing): It is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).


-OLAP (On-line Analytical Processing): It is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schema's (usually star schema).


Based on the above discussion, the below differentiation will help you understand the difference between the OLTP and OLAP.




Hope this explains everything to be understood about the OLAP and OLTP systems.

Now let's talk about the different types of OLAP systems available.

Before moving just keep one thing in mind, this difference is only related to the technology which is used to store the data. There are various ways/technologies available for data storage.


Depending on the underlying technology used, OLAP systems can be broadly divided into
1.  MOLAP(Multidimensional OLAP), 
2. ROLAP (Relational OLAP), 
3. HOLAP (Hybrid OLAP).

Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP: As the name suggests, the data is stored in multidimensional cube. multidimensional storage is not in the relational database, but in proprietary formats.


Fig, 2(a)

In the Fig, 2(a), see how the data is stored in the form of a cube against the data in a relational systems. As you can see multidimensional cube removes the duplication by storing the data in form of multiple dimensions, where in saving the space.


Fig, 2(b)

The Fig, 2(b), represents the similar concept but at a 3 dimensional level. Here prodID becomes one dimension, storeid becomes 2nd Dimension and date is the 3rd dimension. Based on the 3 dimension the value of amt is stored. This clearly indicates that the space utilization is better.

We can now, logically cut the cube across dimensions so as to get the value for a particular dimension. Like we can slice the cute in Fig, 2(b) vertically to get the data related to prodid, and the cube can be sliced horizontally to get the data related to storeid. This concept is called as SLICING and DICING.


Fig, 3(a)

Advantages:

-- Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

-- MOLAP can perform complex calculations. All calculations have been pre-generated when the cube is created. So when we create the cube the calculations are already generated, so while retrieving it becomes very fast. Hence, complex calculations return quickly.

Disadvantages:

-- Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.

-- Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP : As the name suggests, the data is stored in Relational tables. 
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:
-- Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.

-- Can leverage functionality inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:
-- Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

-- Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.


HOLAPHOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

As you can understand now, OLAP can be create using various technologies.

Slowly Changing Dimension - The SCD

The SCD's - Slowly Changing Dimensions

As the name suggest, this concept talks about how do we wish to manage the changes happening in the Dimension table. This concept is specifically for Dimension table.
The Fact tables are usually continuously changing, as in the data in the Fact table changes continuously as against the data in the Dimension table do not change continuously hence called as slowly changing Dimension.

Let us take an example of Employee Table, to understand this.

Consider an Employee, with Employee_ID 101, Name as 'Steve', who is currently working in India as location.



Now consider the Employee is shifted to Japan. In this case, because the location of the Employee is not very frequently changing, Hence called slowly changing Dimension.
Since the location of the Employee is changed, the updates needs to be reflected in the Dimension table. This is the "Slowly Changing Dimension" problem.


There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. We no longer have the old record available.

Type 2: A new record is added along with the old record. So we have Current as well as Old record.


Type 3: The original record is modified to reflect the change. 

SO taking the above 3 scenarios in consideration, we have:

1. SCD 1(Slowly Changing Dimension 1)

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So we do not have the original record. 

SCD1, only the current data, no Historical data is stored.

Advantages:
1. This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
2. Less space required, as we are not storing the historical data.

Disadvantages:
1. No history available, so we cannot trace back to see any historical record.

When to use Type 1:
SCD 1 should be used when it is not necessary for the data warehouse to keep track of historical changes.

2. SCD 2(Slowly Changing Dimension 2)

In SCD2, the new information is added as a new record. In other words, new and old record are present as 2 separate records.

So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So we have the original record along with the current record. 

SCD2, both the current data and Historical data is stored.

The problem in the above case is with Employee ID, Employee ID being Primary Key cannot be same. Also Steve being a same person, cannot have 2 Employee ID's.

This problem can be managed by using the surrogate key concept, by adding another column and giving it a unique value. So the data looks like:

This helps in maintaining the proper sequence for the data, also helps in maintaining the versions.
Then the other issues, How to identify the current and historical data. For that we can create another column which indicates which record is current and which one is historical.

All this is done at ETL layer.

Advantages:
1.  We can use both the current as well as historical records for gathering the information.

Disadvantages:
1.  This will cause the size of the table to grow fast. 
2.  This implementation complicates the ETL process.

When to use Type 2:
SCD2 should be used when it is necessary for the data warehouse to track historical changes.

3. SCD 3(Slowly Changing Dimension 3)

In SCD 3, the new information is added as a new column, as against a new record in SCD2. In other words, new and old record are present in record, but in different column.

So in the above example, as Steve has been shifted to Japan the entry in the Dimension table now looks like,


So when building a Datawarehouse with SCD3, we first decide how many versions of a particular entity we wish to maintain, accordingly the columns in the table will be added, and hence in future will be maintained.

So we have the original record along with the current record, stored in a same record, but different columns.

SCD3, Partial history is stored. Partial as in, what if Steve is again relocated to China.


So we lost the data representing India, hence the partial history.


Advantages:
1. This does not significantly increase the size of the table

2. This allows us to keep some part of history.

Disadvantages:
1.  This SCD will not be able to maintain all the historical data. 

2.  This implementation also complicates the ETL process.

When to use Type 3:
SCD 3 should be used when it is not necessary to maintain the complete history of a particular entity, and the company is happy with partial history.

Hope you are clear about the concept of SCD now.

The Schema's - Datawarehousing tables arrangement

The Schema's:

The concept of Schema talks about how various Dimension and Fact tables can be arranged to create a Datawarehouse.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema, Snowflake Schema and  Fact Constellation Schema. Apart from these there is something called as Hybrid Schema.

Let's talk about each type:

Star Schema: The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with Fact table at the center with radiating Dimension tables. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table.
 A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Below diagram, represents a Star Schema.


                                                                                Fig, 1(a)


Fig, 1(b)

In the figure above, Sales is the Fact table and others are Dimension table. Also notice the primary and Foreign key relationships.

Snowflake schema: The snowflake schema resembles a star schema, with the modification that the dimensional tables can normalized into multiple tables, each representing a level in the dimensional hierarchy. In a snowflake schema, some or all of the Dimension tables can be normalized. This is called as Snowflaking.
Snowflake schema is used to improve the performance of certain queries. .
A star schema stores all attributes for a dimension into one (denormalized) table, which indirectly requires more space as compared to the normalized snowflake schema.  Snowflake schema normalizes the particular dimension table by moving some of the attributes with low cardinality (few distinct values) into separate dimension tables. This normalized table id then connected to the Dimension table by foreign key. Snowflake schema should not be used only with the thought of reducing the disk space, because it can adversely impact query performance in certain cases.
When can a snowflake schema be useful:
  • Tables which will be rarely used, where most dimension member records have a NULL value for the attribute, can be moved to a sub-dimension.
  • Attributes of a specific hierarchy and which are queried independently. Like: Date - Year, month, Day and Location - Country and State.
  • The normalization of dimension tables directly increases the number of dimension tables or sub-dimension tables present in the Datawarehouse which again require more primary key foreign key relationship, increasing the complexity of the datawarehouse. Also reduce the query performance because of more number of tables. 
  • The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.
  • Snowflake schema helps in saving space by normalizing dimension tables.
  • By creating aggregate table(s) and joining it (them) to the required dimension table(s) improves performance by reducing the execution time in some cases.
Below diagram represents the Snowflake Schema:


Fig, 2(a)

Fact Constellation Schema: Fact Constellation schema also called as Galaxy Schema, is mainly identified by the presence of multiple Fact tables and some of the the Dimension tables are shared across the Fact Tables.

Fig, 3(a)
Usually Fact tables are separated or multiple Fact tables are created when we require the aggregate over few Fact or Dimension.
As you can see in the above Diagram, Fig, 3(a), we have 2 Fact Tables and multiple Dimension table. In this scenario we have 2 Fact table, because one Fact is used to store the calculated(aggregate) value at Sales level and other at Shipping level. Also as can be easily notices location and item Dimension tables are shared across both Fact tables.
Another example of Fact Constellation below:
Fig, 3(b)

Hybrid Schema: Hybrid schema is built by utilizing or combining the 3 different type of schema described above. Datawarehousing is a very complex science, which usually require a company to utilize the advantages and remove the disadvantages of other schema which gives birth to Hybrid Schema. 

The Datawarehouse Tables --- Dimension and Fact Tables

In last post we discussed about the Architecture of Datawarehouse and also discussed about the Staging Area in detail.

Let us talk about the various types of tables present in a Datawarehouse.


As discussed earlier: Datawarehouse is combination of multiple tables which are linked to each other. These multiple tables which we are talking about are Dimension table and Fact Table.


Consider a simple Datawarehouse structure below:



                                              Fig: Schema 
Dimension Tables: The subject oriented tables which we discussed earlier in the definition of Datawarehouse are usually the Dimension tables. The Dimension tables are the Master tables(Usually). 

Dimension tables are used to describe a particular entity (dimensions); they contain dimension keys, values and attributes. 

Dimension tables usually describe a particular aspect of a business, like EMPLOYEE, CUSTOMERS, AGENTS, LOCATIONS etc.

Dimension table contains columns like EMPLOYEE_ID, EMPLOYEE_NAME, EMPLOYEE_LOCATION, AGE, GENDER etc.
Dimension table contains data with Numeric as well as text data.

Consider an example of an EMPLOYEE table in a Datawarehouse.

The typical column names in an Employee table will be: Employee_ID(Number), Employee_Name(String), Employee_Age(Number), Employee_Address(String) etc.


Typical characteristics of a Dimension table:


  • Wide rows with lots of descriptive text along with numbers - Dimension table contains data with Numeric as well as text datatypes. Like AGE column is Number and NAME, ADDRESS column is text data.
  • Mostly Small tables (few thousands of records) - Dimension tables usually contain less number of records. Dimension tables are typically small, ranging from a few to several thousand rows. Consider LOCATION table, a business usually don't have many locations. Location count can be in 100's or 1000's. Similarly, EMPLOYEES table, a business can have employees in the range of lakh's, which is not a very big number for Data Warehouse. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows.
  • Joined to fact table by foreign keys - As discussed earlier in the blog of Schema's, all Dimension table have their Primary Key's present in Fact Table as Foreign Key's for reference purpose.
  • Heavily indexed - Dimension tables can have many indexes defined on them. For Example Primary Key, Foreign Key, NOT NULL, Unique etc.
  • Can be normalized
Typical dimensions

Date, Region (Country, Cities), Products, Customers, Salesperson, Employees etc.  

In the Diagram above: Time, Item, Location Branch are the Dimension table.


Fact Table: A fact table is a table that contains the Calculated/measured/Factual value - called as MEASURES. Fact tables are loaded using the values in the Dimension table. For example, avg_sales in the above diagram would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.


Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". 

Consider a scenario, utilizing the diagram above
Time_key = 111
Item_key  =  11
Branch_key = 1111
Location_key  =  1001

An item with Item_key = 11, sold from branch with branch_key = 1111 at Location from Location_key=1001 at a time with Time_key=111 is sold for $50, i.e, 

11 * 1111 * 1001 * 111 = $50.

Now, 11, 1111, 1001, 111 would be stored in the respective Dimension table and the calculated value $50 would be stored in the Fact table. Also it is said that the product has been sold for 5 times, which will be present in the Fact table.

The structure of Fact table will be:


As explained above we can have large number of such calculations. 

Hope you are clear about the Dimension and Fact table.

Some definitions related to Tables:

Attribute: A unique level within a dimension. These are nothing but the column names of the tables. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.


Types of Fact Tables:

There are two types of Fact tables:


  • Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
  • Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Types of Facts

There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the             dimensions in the fact table.

Semi-Additive: Semi-additive facts are facts that can be summed up for                 some of the dimensions in the fact table, but not the others.

Non-Additive: Non-additive facts are facts that cannot be summed up for               any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first
example assumes that we are a retailer, and we have a fact table with the
following columns(This mean we have 3 Dimension tables each containing
one primary key that is present as foreign key in fact table, as shown
below):
Sales_Amount is the Measure in the Fact Table.

        Date_ID
       Store_ID
        Product_ID
     Sales_Amount

Consider values present in Dimension table:
Date Dimension Table:
Date_ID, DATE
1,01-01-2015
2,02-01-2015
3,03-01,2015
4,04-01-2015

Store Dimension Table:
Store_ID, Store_Location
101,India
      102,Germany
103,China
104,Japan

Product Dimension Table:
Product_ID,Product_Name
10,RICE
   20,SUGAR
  30,BREAD
   40,WHEAT

The purpose of this table is to record the sales amount for each product
(Product_ID) in each store (Store_ID) on a daily basis(Date_ID).

In this case, Sales_Amount is an additive fact, because Sales_Amount
can be calculated using the values of all the 3 KEYS present in the Dimension
tables, i.e, DATE_ID, Store_ID and Product_ID. 

So, Sales_Amount for 01-01-2015(DATE_ID=1) in India (Store_is=101) for
RICE (Product_ID=10), is $100. $100 here is Sales_Amount present in Fact
table which is calculated using all 3 three dimension tables values, hence is
called Additive Fact.

Let's take next example:Say we are a bank with the following fact table:
          Date
       Account
    Current_Balance
     Profit_Margin

The purpose of this table is to record the current balance for each account at
the end of each day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the Measure's.

Current_Balance is a semi-additive fact, as it makes sense to add them up
for all accounts (what's the total current balance for all accounts in the
bank?), but it does not make sense to add them up through time (adding up 
all current balances for a given account for each day of the month does not
give us any useful information). 

Profit_Margin is a non-additive fact, for it does not make sense to add
them up for the account level or the day level.

In the next post we will talk about the different types of Schema's. Another very important concept.

The DW Architecture Explained - Detailed Version

Let's talk about the detailed Architecture of Datawarehouse.

In the previous post we have seen about the Architecture of Datawarehouse at a high level. In this post the Architecture is explained at a detail level.


Fig. (c)



Check out the image above, the image is differenciated in different layers. Let's talk about each layer in detail. Some of the layer in the Architecture are very easy to understand as the names are self-explanatory.


1. Data Source Layer: This is the layer where the raw data is present, treated as Source for the datawarehouse.



This indicates the data source that feed data into the data warehouse. The data source can be of any format – flat file, relational database, other types of database, Excel file etc.

Consider a company which is operation since 5 years, and after 5 years they have decided to create a datawarehouse. For last 5 years they must be storing the data in some form, may be the excels, may be some tables etc. Those excel files and tables will become the source for loading the datawarehouse.



So various types of source which we can have are: 
a. Excel sheet, flat files, 
b. Relational database, 
c. Live database,
d. ERP Systems,
d. Legacy Systems
e. Web server logs with user browsing data.
f. Internal market research data.


g. Third-party data, such as census data, demographics data, or survey data.

2. Data Extraction layer: This is the layer where the data from the source is extracted and loaded into the next layer, called staging area(we will see more details in next section).


In this layer the data extracted from source can be modified and loaded into Staging area else it can be directly loaded into staging area without any modification.


This phase can be renamed as ETL phase only, as the data extraction and loading will be done using ETL Tools only.


3. Staging Area: It is a common place where all the data is collected before it is being summarized and loaded into the data warehouse.

This is very important stage in a Datawarehouse Architecture. 

We will discuss in details in the next post about the Staging Area.


3. ETL Layer: This is the layer where we Extract the data from Staging area(if present), modify/transform/cleanse and load into datawarehouse. We use ETL tools at this layer.



Earlier this work was done using SQL or Mainframes process which used to take long time to complete and was also difficult to work on. 
With the invent of ETL tools we have various tools available in market which makes this task simpler. 

Different type of ETL tools available in market are(but not limited to):
1. Informatica,
2. datastage,
3. Abinitio,
4. Pentaho etc

These tools are in lot of demand in market, as these tools help in faster, better and easier processing of the data.

4. Data Storage Layer: This is nothing but the Datawarehouse, where we store the analytical data for analysis.


5. Data Logic Layer: This is the layer where we use the data stored in the Datawarehouse to analyze and generate reports and do many other things. We can write the query, use the reporting tools, use data mining tools as per business needs.


Like many ETL tools, there are lot of reporting tools available in the industry, which have now replaced the old reporting technologies and helping organizations do better and faster reporting.

Some latest reporting tools available in market are:
1. OBIEE
2. Qikview
3. Microstrategy
4. Cognos
5. Tableau
6. Hyperion etc

6. Data Presentation Layer: This is the layer where we create the reports, analysis from the data stored in the datawarehouse. There are various Reporting/Analysis/Mining tools available which helps to do various operations on the data.


7. Source Operation Layer: This layer describes that once we extract the data from source layer we perform various operations on that data.


8. Metadata Layer: Metadata is nothing but the data about the data. So all the layers described above have the Metadata.


This completes the explanation of the Architecture of Data Warehouse.


In the next post we will talk about the Staging Area in detail.