I feel it is important to talk about different approaches to Data Lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things – tables, columns, reports. But data lineage is more about logic.
It is more about programming code in any form. It can be SQL script, PL/SQL stored procedure, Java program or complex macros in your Excel sheet. It can literally be anything that allows you to somehow move your data from one place to another, transform it, modify it. So, what are your options and how to understand that logic?
Option 1) Ignore it! (aka Data Similarity Lineage)
No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities.
Tables, columns with similar names or columns with very similar data values, those are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I hate very much). There is one great thing about this approach – if you watch data only and not algorithms, you do not care about technologies and there is no big deal if customer uses Teradata, Oracle or MongoDB with Java on top of it.
But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, very often requested by customers) and lineage is limited to the database world ignoring the application part of your environment.
Option 2) Do the “business” lineage manually
This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, Data stewards, data integration specialist should give you fair but often contradictory information about data movements in your organization. And if you miss asking someone you simply don’t know about; a piece of the flow is missing! This often results in dangerous situation when you’re having a lineage but unable to use it for real-case scenarios – not only you do not have trust in your data, but on in the lineage either.
Option 3) Do the technical lineage manually
I will go simply to the point here – trying to analyze the technical flows manually is simply destined to fail. With the volume of the code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially a need to reverse engineer the existing code this becomes extremely time consuming and sooner or later, such manually managed lineage falls out of sync with the actual data transfers within the environment and you end-up with a feeling of having lineage that you cannot actually trust.
Now, that we know that automation is the key, let’s take a look at less laboring and error prone approaches.
Option 4) Trace it! (aka Data Tagging Lineage)
Do you know the story of Theseus and the Minotaur? Minotaur lives in a labyrinth and so does Ariadne who was in charge of the labyrinth. Ariadne gave Theseus a ball of thread to help him navigate the labyrinth by tracing his path back.And this is a little bit similar approach.
The whole idea is that each piece of data that is being moved or transformed is tagged / labeled by a transformation engine which then tracks that label all its way from start to the end. It is like that Theseus. This approach looks great but it works well only as long as a transformation engine controls every movement of data. A good example is a controlled environment like Cloudera.
But anything happens outside its walls and lineage is broken. It is also important to realize that lineage is there only if transformation logic is executed. But think about all exceptions or rules that apply only once per a couple of years. You will not see them in your lineage till they are executed which is not exactly healthy for your data governance. Especially if some of those pieces are critical to your organization.
Option 5) Control it! (aka Self Lineage)
The whole idea here is that you have all-in-one environment that gives you everything you need – you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like that, especially with the new Big Data / Data Lake hype. If you have a software product of this kind, all happens under its control, every data movement, every change of data. And so, it is easy for such a tool to track the lineage.
But there is the very same issue as in the previous case with Data Tagging. All what happens outside the controlled environment is invisible. Especially when you think of long term manageability, over the time as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.
Option 6) Decode it! (aka Decoded Lineage)
Ok, so now we know that logic can be ignored, traced thanks to tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly – constantly evolves.
But there is still one other way – to read all the logic, to understand it and reverse engineer it. It literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic or XML based languages used by ETL tools or reports. And that is the challenging part. It is not easy to develop sufficient support for one language, you need tens of them in most cases to cover basics of your environment.
Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways how to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.
An earlier version of this article was published on Tomas Kratky’s LinkedIn Pulse.