Datawarehousing - The definition - Explained in simple terms

In this post we will be talking about the Definition of Data Warehouse.

In the couple of previous posts we have seen some very basic information about datawarehouse. I have tried to portray datawarehouse in simple terms.


Datawarehousing, as the name suggests is a game played with data. The data for today, yesterday, day before yesterday, a week old data, an year old data stretching may be up-to the ages of your grand grand father.

The data warehouse is mainly the term used in the companies to manage the data. The larger the company, the bigger the data, the bigger requirement to manage the data.

The concept of a data warehouse is to have a centralized database that is used to capture information from different parts of the business process. The definition of a data warehouse can be determined by the collection of data and how it is used by the company and individuals that it supports. Data warehousing is the method used by businesses where they can create and maintain information for a company wide view of the data.

Lets discuss about how actually we can define datawarehouse - as given by pioneers of Datawarehousing.


-- A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.


Other one - a little technical definition...


-- A process of transforming data into information and making it available to users in a timely enough manner to make a difference



Lets understand these points in details:

1. Single and complete: A single/complete store of data???

Here is an example:

Let's say you started a small boutique in Ohio, United States - you sell authentic dresses.

Whatever transactions where done, you are saving those in your personal computer in an excel sheet. Since its just a start-up you do not have many transactions - say 10 transactions per day, and you can use your excel sheet to store the details.
So on an average you are generating 100(transactions/day) * 30 (days/month) * 12 (months/year)
i.e, 10 * 30 * 12 = 3600.

Example of some transaction you saved.

So we have 3600 records in your Excel Sheet per year.

Now suppose after an year you stand good in your business and think your market is growing and you wish to analyse the data and see which brand sells best, or you want to analyse which quarter the sell is at its best etc. 
Since you have all your data in a single personal computer of yours, you can just easily analyse the data and find out all you need and accordingly you can make strategies.

With gods grace and your hard work, you thought of expanding your business and you started a new boutique in Ohio - so now you have 2 boutiques in Ohio. You hired a person to manage the boutique - he takes care of managing the new boutique and storing the data in a personal computer at new shop.


After few days you wanted to analyse the performance of new boutique - How will you do that???

1. You travel to new boutique - check the data in personal computer there and analyse.
2. You call the manager of new boutique with the data to your place - you check the data and do analysis.
3. You ask the data from new boutique over email - and you analyse.

All this is possible because - 

a. You are located in a quite small geographical space of world.
b. Your data is not that huge.

After 2-3 years, you are now expanding your business to other states of US - you are planning to start 10 new boutiques - now the larger once. Father more in next few years, you have shops in 5 different continents of this world.


CAN YOU NOW

1. travel to each boutique in 5 different continents to check the data in personal computers at each boutiques???
2. call the manager of each boutique with the data to your place with the data for doing analysis???
3. Ask for the data from each boutique over email - and do analyse????

ANSWER - NO


Reason???

1. You are now located in quite a larger geographical area.
2. You cannot travel to each location to check the data and do analysis
3. You do not have skills to analyse the data which is HUGE now.

SOLUTION - A DATAWAREHOUSE

You need a centralized location where all your data is stored. Datawarehouse is the solution. You have data across your enterprise stored at a common location, which you can use to analyse with ease.

2. Consistent: What do we mean by consistent in Datawarehouse?


As we discussed in above example, that you have hired managers to manage the store and in turn its the manager's responsibility now to manage the data as well.


Now suppose there is a data entry operator in Japan and other in China. The ways in which they enter the data will be different. See the example of data snapshot again:

                            


In this example, the Date of Purchase is not stored in a common format. now if you wish to do analysis by looking the data, it will be difficult to do so and there are chances that you miss some data for a particular analysis.
Here, sales done in FEBRUARY is represented in different format i.e, 2-Feb-13 and 5/2/2013.
1. There are chances that you miss some entry while you analyse using data stored in excel sheet.
2. You will lose the data if you use SQL to extract the data, because you are not aware of different formats in which data is stored.

SOLUTION - DATAWAREHOUSE.


Datawarehouse stored the data in consistent format. Means, when we store the above data in Datawarehouse, the data will be converted in a common format(example - DD-MM-YYYY) before loading into datawarehouse.


Now you can understand what advantages it will have.

1. You can just easily extract the data.
2. You will not miss any data due to in-consistent data format.

I hope now you understand the definition:

singlecomplete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.


The most popular definition came from Bill Inmon (Father of Data warehousing), who provided the following:

-- A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Subject-Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Here is an example:

We will illustrate the above example.

The boutique which you started is grand success now. You have around 100 stored across globe now and you are generating huge amount of data. Now there is a need for proper analysis of that data so that you can do the business and make future policies for your stores across different continents.
For that you decided to build a datawarehouse. 
As you are aware datawarehouse is combination of multiple tables. How many table? What type of table? What will be the structure of table?

When you run a business, you know that there are few important aspects of your business i.e. there are few important entities of your business which are must to run your business. For example - 

1. Employees: This is very important entity of your business to run.
2. Product: This is the entity which tell you for what all products you do business.
3. Location: This is the entity which tell you what all locations you are doing your business.
4. Customer: This entity is most important which tell you about all you customers.
    There are many other, we will take only 4 for example.

Now a good Datawarehouse design says - The important entities of your business should be present as a separate table in datawarehouse i.e, you create a EMPLOYEE table, a PRODUCT table, a LOCATION table, a CUSTOMER table.


This is called as SUBJECT ORIENTED - The important aspects/entity of your business should be present as a separate table.


Why this is important:

Since these are important entity of your business, you create a separate table for these entities and you store data accordingly. Now you have four table with data:
1. Employee Table: Contains data for all the employees you have across enterprise.
2. Product Table: Contains details of all the products you do business across globe.
3. Location Table: Contains details of all the location where you do business.
4. Customer Table: Contains details of the customer with whom you do business.

When you wish to extract some data related to a particular Employee, you will go to employee table and get that data, you need not search in random across other tables. This saves your time and resources. Also gives faster result.



Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
This is similar to the explanation we saw for SINGLE source of data in above explanation.


Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, 20 years or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

I hope you are not in a better position to understand about datawarehouse. Atleast the Definition now !!!

In next post we will see more concepts of datawarehouse. 

A Big warehouse - Data Warehouse



Similarity in a Warehouse and Data Warehouse:

If you are a new born baby or heard about datawarehosuing for first time - here are some definitions of Datawarehouse for you.
We will see the definition given by pioneers later in my post. 

a. Datawarehouse is a place where I can store huge amount of data in a systematic manner.

b. Datawarehouse similar Database to is nothing but combination of multiple tables linked to each other through primary key and foreign constraints.
c. Datawarehouse stores data collected over multiple years(Historical).
d. Datawarehouse is used for analyzing the data for decision making purposes of the enterprise.
e. Datawarehouse is a place where we store huge amount of data, from which we can easily extract the data.

Now read this:

Datawarehouse is combination of multiple table linked to each other which stores the historical data in systematic manner which is used for analyzing the data for decision making purposes.

Moving forward - what actually we mean by all these terms.


What do we mean by storing the data in systematic manner and why it is important.

Don't understand anything out of this !!!!

Let's talk about some familiar terms first.


Have you ever got a chance to see any warehouse(not the city retail shop) - The big storage place usually out of city where companies keep there products. And based on the requirements the products are dispatched to retail shops.


Take a look at fig(a) below, a place which is not well arranged and not maintained. Now if I try to go there and get some thing (say - a PENCIL) out of that, it will be really difficult for me.


Fig(a) - A warehouse which is not systematically arranged. 

Disadvantages of this place:

1. I am not sure if I have PENCIL.
2. If I am sure PENCIL is there, I am not sure where it will be
3. I am not sure how many PENCIL's I have?
4. Even if I am aware that I have pencil in that huge stock, I am not sure how will i reach there and take that out..

Now have a look at fig(b) below, a nice place !!!



Fig(b) - A warehouse which is systematically arranged.

Advantages of this place (you all can now easily answer..)

1. I am aware of the place where PENCIL is present.
2. I can easily go there and search for the PENCIL.
3. I will save my time, since I know the place where PENCIL is kept.
4. Since place is properly arranged, I will be aware how many PENCIL's we have stored.

Now relate above figure(a) as your data and fig(b) as datawarehouse.


Only having the data is not enough, having the data systematically arranged is important so that you can use that data for analysis and other purposes in future. - This is where Datawarehouse helps.


You all know every company which works in whatever corner of this world generates Data - some generate huge data(Amazon.com, Wallmart etc) and some generate little data(a new start up). So when we see there are so many advantages of having a Datawarehouse why don't every company have a datawarehouse.


Friends - you might have heard - Every coin has 2 sides. So when Datawarehouse brings in so many advantages, it comes to you with some disadvantages.


To list a few:

1. Heavy cost of building the datawarehouse - you need integration tools, you need reporting tool, You need the storage devices etc.
2. Heavy cost of maintaining the datawarehouse.

The cost is the primary aspect which stops many companies from having a datawarehouse.



In the next post we will see the definition of Datawarehousing.

Datawarehousing – not any more a alien world

Data Warehousing !!!!

This no more is the thing with only IT companies, the Mechanical companies, The electrical companies, The Manufacturing Industry, The Hospitals, The Banks, The Telecom Companies all are using the Data Warehousing, since all have understood the power of storing the data in sophisticated manner and using it for Reporting and Analytical purposes.

The DATAwarehousing, as the name suggests is a game played with data. The data for today, yesterday, day before yesterday, a week old data, an year old data stretching may be up to the ages of your grand grand father.

The data warehouse is mainly the term used in the companies to manage the data. The larger the company, the bigger the data, the bigger requirement to manage the data.
The concept of a data warehouse is to have a centralized database that is used to capture information from different parts of the business process. The definition of a data warehouse can be determined by the collection of data and how it is used by the company and individuals that it supports. Data warehousing is the method used by businesses where they can create and maintain information for a company wide view of the data.

The History and Evolution of Data Warehousing

Many people feel that data warehousing is only copying data from one place to another, They feel “That doesn’t make any sense! Why waste time copying and moving data, and storing it in a different database? Why not just get it directly from its original location when someone needs it?” This is totally incorrect. Data Warehouses are very much evolved as compared to Databases. 

Data Warehouses do far more then just storing the data.

Going to 1970-80's years back, the computing world was dominated by the mainframe in those days. Real data-processing applications, the ones run on the corporate mainframe, almost always had a complicated set of files or early-generation databases (not the table-oriented relational databases most applications use today) in which they stored data.

Although the applications did a fairly good job of performing routine data processing functions, data created as a result of these functions (such as
information about customers, the products they ordered, and how much
money they spent) was locked away in the depths of the files and databases.
The way data was stored made it more complex to solve the purpose.
It was almost impossible, for example, to see how retail stores in one
region were doing against stores in the other region, against their competitors, or even against their own performance in some earlier period.

Between 1976 and 1979, the concept grew out of research, driven from discussions with Citibank’s advanced
technology group, called TARADATA.The name Teradata was chosen
to symbolize the ability to manage terabytes (trillions of bytes) of data.

Early days of Data Warehousing - 1980's

As the era began, the computers were the only name making there presence everywhere. The organizantion started to have computers every where in each department. How could an organization hope to compete if its data was scattered all over the place on different computer systems that weren’t even all under the control of the centralized data processing department? (Never mind that even when the data was all stored on mainframes, it was still isolated in different files and databases, so it was just as inaccessible.)
A special software then came into existence which made the life simple for the user and everyone, using the data or analyzing the data. This new type of software, called a distributed database management system (distributed DBMS, or DDBMS), would magically pull the requested data from databases across the
organization, bring all the data back to the same place, and then consolidate
it, sort it, and do whatever else was necessary to answer the user’s question.
Although this was thing all were expecting, but the chocolate is always better if its sweeter, the sweetest the perfect. The chocolate being RDBMS (relational database management system) for decision support — the world’s first.

Why Data Warehouse over Normal Databases !!!

The need has been established for a company wide view of data in operational systems. Date warehouses are designed to help management and businesses analyze data and this helps to fill the need for subject- oriented concepts. Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When this concept is achieved, the data warehouse is considered to be integrated.
The form of the stored data has nothing to do with whether something is a data warehouse. A data warehouse can be normalized or de-normalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. Also, data warehouses will most often be directed to a specific action or entity.
Data warehouse success cannot be guaranteed for each project. The techniques involved can become quite complicated and erroneous data can also cause errors and failure. When management support is strong, resources committed for business values, and an enterprise vision is established, the end results may turn out to be more helpful for the organization or business. The main factors that create needs for data warehousing for most businesses today are requirements for the companywide view of quality information and departments separating informational from operational systems for improved performance for managing data.

With this basic history of Data Warehouse, lets try to gather more information about the Giant - The Data Warehouse.

There are numerous pages, blogs, websites these days which talk about Datawarehousing. No doubt few of them have really good information.

I have gone through many of them – Its great to see the enormous knowledge people gather in there life time and the more better work they do is share that knowledge with others.

When I shuffle through the websites, blogs I sometimes feel annoyed that none location provide everything at its best. Definition of Datawarehousing is written in great manner in one blog, OLAP and OLTP is good at other ........!!!
Other bad things – Who the hell is going to tell me how do I relate all these to world around me so that I can understand its importance.

Now try reading few other pages on internet – and you will understand what I am trying to say in my last sentence above.


After working of quite a few Data Warehousing projects and gathering lots of information from the pioneers article - here are few of my real life examples for all the topics which we say “DATAWAREHOUSING CONCEPTS” !!!  


I will try to provide some real life scenario for each concept which will be useful in helping you gather real life knowledge of Datawarehousing.