If you use Kimballs Star Schema and refer to it as your Data Warehouse, you should read on.
I know, this is a bold claim to make. And everyone who has some experience in Data Warehousing might be surprised in what I mean. But sometimes, your Data Warehouse is not a Data Warehouse. Why?
Data Warehouse Definition
When talking about Data Warehouses, there is no common this-explains-all definition available. In fact, when I was doing my background research for this article, I found several different commonly available definitions of Data Warehouses.
Often, a database, which is modeled with Kimballs Star Schema, is referred to as a Data Warehouse. A common misconception in the Business Intelligence world is that a Star Schema is a Data Warehouse – it is not.
It is important to keep the purpose of a Data Warehouse in mind. A Data Warehouse should be a flexible, integrated representation of relevant data in the business:
- Flexible – The Data Warehouse must be able to adapt changing business requirements fast and be able to consume new data sources quickly.
- Integrated – It should be able to load the data from several source systems and connect them in the Data Warehouse so it is possible to gain insights by using several source systems.
- Representation – This means that usually there will be a copy of the source data available to lessen the performance impact on the source system.
- Relevant data – Not all source system data or tables will be loaded, only the relevant data for the wanted use cases.
So, if you take a Star Schema for the Data Warehouse (bear with me, I’ll explain further in a few moments), I say a Star Schema is not a Data Warehouse. It will not be able to be flexible enough down the road. Sure, a Star Schema is always good for the so much needed quick wins in the beginning of a Business Intelligence project. But don’t be fooled – in the long run it will become very hard to test, to maintain and to expand.
Bill Inmons („Father of data warehousing“) definition goes a little further:
A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process.Bill Inmon
Especially the parts nonvolatile and decision-making process are important from this definition.
But before I continue, a little more on Architecture.
Data Warehouse Architecture
So, if a Data Warehouse is not simply a Star Schema, what is a Data Warehouse then?
Usually Data Warehouses are split up in several layers which I will describe very roughly (visit Martin Ouellets blog for a more in-depth explanation of those layers):
- Staging – This layer connects the Data Warehouse (which might be a dedicated server machine) to the source systems. It has access to databases, files, network folders etc. pp. The data is being loaded temporarily into staging tables most of the time. No pre-aggregation or pre-calculation takes place – the data gets consumed as it is. Usually only being filled when data is extracted.
- Persistance – In this layer the actual representation of the source data is stored into. There are several ways to store the source data, which I will explain one step further down.
- Presentation – This is the most interesting layer for all users which want to access the data, also known as Data Mart. But this is not a direct access to the persistance layer. Instead, the data from the persistance layer is being formed into a – you might have guessed it already – Star Schema for easy access. Now take the tool of your choice and dig into the data. Also, only here is the place to set those special calculations Marketing wanted. Because we can always regenerate the presentation layer, any errors or mistakes can be corrected. Furthermore, if Sales and Marketing need different calculation rules, both can get their own Data Mart with their own rules. Nonetheless, the data in the persistence layer are unchanged and still represent the data.
The last point is normally controversial up to the point that data is not comparable. But this is more a topic for Data Governance.
As usual, the process to get the data can get very long and complex. Also, to generate the Data Mart correctly can take some time. The only thing the user sees are the finished reports or dashboards.
That feels like a big iceberg. The data processing, data consolidation is all under water and not visible… but the presentation is visible.
And as we all know, icebergs are massively bigger under water than over it.
Because of that the persistent layer of the Data Warehouse should not be a Star Schema. In fact, I would go so far and say that a Data Warehouse which consists only of a Star Schema is not even a Data Warehouse – it is just a representation of some business data and its rules, or more exact: a Data Mart.
Even if there are two layers (persistent and presentation layer) it makes little sense to create one Star Schema for the persistent layer and one for the presentation layer. Why? Because the preparation part will become more complex the more rules apply to the second schema.
So, to model the persistent layer of the Data Warehouse, other methods are better suited. The few main ones usually to be found are:
- Third Normal Form – 3NF
- Data Vault
- Anchor Modelling
Each and everyone of those do have their own ups and downs. In the end it should not only be important which model is being used but also how patient the business is (which it is usually not). It is not the intention of this article to explain those modeling techniques.
I hope this clears up a few misconceptions I read about very often.