A while ago I wrote about using MySQL as a database for Data Vault 2.0. Now, after some time passed and the dataset grew, it is time for a new evaluation.
The main difference between the first blog post and now is that the dataset grew from a few hundred megabytes to about 20 GB of data. We’ve had some significant slow-downs on our old database development server which provided only around 4 GB of RAM for MySQL. Since 20 GB is bigger than 4 GB (hehe), there was a lot of reading from the hard drive when database pages were needed. Michael Bouvy has a nice explanation of the reasons why and how this happens.
Especially for statements which used a great amount of Hubs, Links and Satellites, we got problems:
- Complex statements would load very long (minutes)
- A few, newer reports with more complex Data Vault requirements would load into eternity (and block subsequently following processes in the db)
- Since relationships (e.g. Links) are a bit of a problem in Data Vault 2.0 if the data is not transaction based there were a few subselects to get only newest relationships. This killed a few queries which worked without this little tricks.
The first two points would happen to any dataset which would be bigger than the available RAM (or buffer as in MySQL talk), the third point is a topic on its own. So we decided to move our development database to a more capable server. This server provides enough GB of RAM for our dataset to be completely loaded into it and the innodb buffer size was set to around 80% of the complete RAM size.
This made a big performance difference, which should be no surprise. The reports would load in seconds and the complex statements would finish much faster. Once in the buffer and also maybe in the query cache, things sped up a bit again. This made it fun again to work with all the data!
On the subject of End Dating Satellites
End dating is a matter of its own.
First, we’ve had our MySQL backend on the same server as our ETL jobs. The ETL jobs would load every row, issue an UPDATE statement (which has to be altered slightly to the statement from the book to enable it to run in MySQL at good performance) and send it back to the server.
This, of couse, ran very fast.
Then we switched databases. The new database was now connected over Ethernet and no longer directly available.
End Dating times went for several big entities from around 3 to 5 minutes to 40 to 60 minutes. Obviously the bandwith of the server connection would not be beefy enough to handle all this text sending faster. Even if it did, it would still be a bad idea to load all data just for end dating satellites.
Our solution: Creating stored procedures.
We had to create a stored procedure for every satellite since MySQL doesn’t support dynamic table names in stored procedures at the time we checked. Then we created a “general purpose” ETL job which would handle the creation and execution of the stored procedure and simply call it out of our load jobs.
So this stored procedure would essentially load a DISTINCT list of entity hash keys and loop the UPDATE statement though them.
With this approach, end dating for about a million rows with a bit of delta takes about 3-5 minutes on our current setup. Which is fine enough considering we are now back to our “normal” times but utilize the database server now a lot more. For small entities we still use a different approach with downloading the statements but we could switch eventually there as well. For a few big entities, we switched from ETL loading to ELT loading as well. This is a bit complicated in Talend, but possible nonetheless. This was with the old setup not big of an issue but now ELT is much better.
Using both of those approaches (big enough innodb buffer, end dating for large datasets in the database with stored procedures) we get pretty good performance for loading and using Data Vault 2.0 data in our development environment in MySQL. We will experiment some more with the MySQL settings eventually and there might be some more performance in it with custom settings.