Have you ever been working with an internal or external customer who said he just needs this simple Excel into a report? When investigating the process, you find out that there are many undiscovered steps in this process? And you told them that this implementation might take a few days which in turn didn’t make them happy. Could this be avoided?
I mean, I get it. We live in very dynamic times. The pressure is on. Organizations need to adapt and test quickly new products, create new value chains to stay on market and compete with the other players.
Still, if a new product comes up, after the initial chaotic phase, a few adaptions have to be made. A bit of documentation, adaption to general company procedures and processes and problem management.
The reality, although, is quite different:
- New product gets proof of concept
- Proof of concept is being transformed into something „live“
- All done, up to the next phase – no one wants to be „unproductive“ (with documenting etc.)
- Months later the investigation starts to understand the completely undocumented process which of course is business critical
My point being, eventually some key persons might leave the organization. Or there might be a change in infrastructure and no one is aware that a specific process is using a specific infrastructure.
This also affects Business Intelligence processes in a great way. How many times did you get some data without knowing
- where the data is being generated, there are only new files / updates in a database
- if there are any measures if the data gets corrupted or the delivering process has a problem
- who the heck integrated several new fields and how those Business Rules look like
- who is the owner of the script / ETL process / stored procedure
- how long the processing takes and how „old“ the data is at the time it is being read
- when the processing takes place and if it will execute again if it fails
Now, this is not primarily about the data itself, but about the process of generating the data, hence meta data of data generating jobs – especially for pre-existing, legacy jobs. I bet you’ve had your fair share.
To clear those questions up can take a lot of time on its own. And remember: there is not one step included about processing the data into a data warehouse or into Hadoop or where else it might lay, this is just to „get to know“ the data. I call this frequently a lack of Data Awareness.
It is not that simple as management might think to just take data and put it somewhere. If I am part of a Business Intelligence department, I do have many other questions in mind first:
- How do I make sure the data is being loaded constantly?
- Where is the process done exactly at what time how often?
- What happens if an error happens?
- Who will implement changes – especially important for old and abandoned scripts which are still being used frequently (generally then someone tells you Don’t touch this file!)
- And finally – can I make sure that all source data is being processed correctly? Meaning: Is this „just“ a simple ETL job or is there a lot of business logic happening?
Only if the complete process is clear – some people call this the lineage of data – I would be investigating further if there are any requirements regarding the storage and the reporting of the data.
But without knowing if and when a process is delivering data it can become a hassle very quickly if something goes wrong way upstream before the data warehouse. Only with good and consistent meta data about data this can be avoided. Also, if something turns out being not handled very well I could adress or at least escalate it to raise awareness (here we go again: „Data Awareness“) about broken steps in the data lineage.
A big part of this, especially in smaller organizations, is a central chosen point for this documentation. I’ve been using Intranet pages, internal Wiki pages, explicit folders for describing documents for data… it comes up to the own creativity in how meta data will be stored.
With bigger budgets there are powerful applications available which can do a lot of infrastructural analysis as well as providing insights into database schemas – knowing this can be, regarding the use case, a huge advantage in developing data lineage, reports and requirement analysis. But in my experience it is first all about centralization of data flows in the organization. This also means that there might not even be something to do for the Business Intelligence department but to write down a process which is gathering data.
Sometimes it is more helpful which data flows where and just know about it than trying to centralize all jobs and re-build them just for the sake of it. I try to focus on delivering value and using existing resources, not to generate more work. This helps the team and the customer more quickly, resulting in sprints times to build a report. Especially if a lot of the data processing processes in the organization are known already.
Business Intelligence can sometimes be a bit like being a detective. A detective with quality reports, though.