One question I read and talk a lot about is if it is neccessary at all to use a schema with my data, especially in those current NoSQL times. So, is it important to model?
Absolutely! It simply doesn’t matter if you put the schema on the data before putting it into the data warehouse (schema-on-write) or if you want to put a schema on top of it if you load it from your NoSQL database (schema-on-read). You are still in need to have a strong understanding of your data.
Sure, there is a difference in the consumption process, error resistance and the maintenance effort if a NoSQL database is being used. This comes in a mindset of load the data first, then think about how it all comes together. But I think this is simply wrong.
Because there is still a whole lot of effort to load data in the first place. Also, you will need a schema at some place in time to interpret data eventually. If you don’t have a clear understanding of how your data looks like and how different systems are interconnected (think business keys), you will have a hard time to provide sincere, complete value to your stakeholders.
This also comes with a small danger. Schemaless data can – obviously – change whilst being consumed into a NoSQL database. This comes at no cost.
BUT. You will have some data transformation processes, reports, maybe business processes depend on this data. If the schema changes, new data will not be in the expected format. If this goes unnoticed, it can impact the business. Omitted data is usually harder to find if the schema is volatile than having a schema-on-write process and require those fields at least filled with nulls. They should be there. Otherwise the schema-on-read process is faulty.
It is (mostly) impossible to get repeatable, auditable metrics, KPIs, dashboard, or reports that bring value to the business without understanding the semantics of the data – which means you at least need a conceptual or logical model.Kent Graziano
Kent is also a strong advocate of this paradigm as you can see in his blog. I strongly encourage to visit his thoughts on this since he has been a long time in this game.
To sum things up, I absolutely advise for modelling the data which is received and to have a very clear understanding what this data represents and how to interpret it. This also goes for those data lakes. Make sure to understand the data in it – otherwise the lake will become empty very fast.