To implement Data Vault and its modelling technique, a database is required. Can MySQL cut it?
The implementation method of the different layers of your data warehouse is a big and important question to be answered. As I wrote before, a Star Schema is not a data warehouse. I assume this decision has been done and the modelling technique of your choice is Data Vault 2.0. Is MySQL a good candidate all-round?
MySQL comes with two database engines, MyISAM and InnoDB. Both of those engines do have their advantages and disadvantages, so it is important to make up your mind first. This can become a complex topic and also a personal choice, but I’d pick InnoDB because of its transactionness.
Now, there are three cases one database should be available for in the Data Vault 2.0 methodology:
- Staging area – very fast consumption speed needed
- Persistent layer – should be able to consume data out of the staging layer and also be able to insert fast, should be able to handle big data. Stakeholders will not use this layer directly, so lookup performance is usually not an issue
- Data Mart (Data Vault 2.0-speak: Information Mart) – should be able to provide fast performance over a star schema
First of all, in my experience, MySQL and Data Vault 2.0 fit together very well to get things up and running. We tweaked the MySQL settings a lot to be able to process bigger transactions, to hold up transactions longer and to use more RAM / InnoDB Buffer if possible. This included (this is just a loose write-up of a few settings, sometimes a few have to be changed together):
- innodb_buffer_pool_size: Set to a decent amount of RAM. But you want to avoid swapping if you are on Linux!
- skip-name-resolve: Disables name lookups and can under certain conditions speed up lookups
- slow-query-log: VERY important for finding performance issues
- Also: server monitoring, especially RAM, CPU, swap and disk space. This is very helpful to understand changes in settings
What helped really well to find out those settings was MySQL Tuner, a handy little Pearl-script.
This all applies to the staging area and the persistent layer. But MySQL seems to be not that strong for star schemas – Oracle and Microsofts solutions do have way more optimization in that field. But do also cost a little more money than MySQL.
I’d say MySQL is capable of providing enough beef for Data Vault 2.0, the performance is very good in our case. Especially with the new techniques introduced in Data Vault 2.0. I should add that I have not worked on a cluster-based, master-slave big data MySQL infrastructure yet. So I cannot answer this question for the biggest of the biggest data resources. Also, the Data Mart might need some other database later on, when more load and data is on the MySQL database.
Some Data Vault 2.0 techniques like end-dating satellites might need some extra attention and fine-tuning, but it is still possible and manageable in my opinion.