You may or may not have heard about record level lineage. This is a topic that our customers ask about quite frequently, so our vice president of development, Lukas Hermann, decided to write an article where he answers some of the FAQs. Continue reading to find out more about record level lineage and why we don’t have it.
What is record level lineage?
Record level lineage is an approach to data lineage that is similar to data tagging. The idea behind data tracking is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label all along its way from start to finish. This approach seems great, but it only works well when a transformation engine controls the data’s every move. Some good examples are controlled environments like Cloudera or Dremeo that focus only on the origin of one specific record.
Record level lineage vs. column level lineage
A feature that MANTA does have, that in a way is similar to record level lineage, is column level lineage. What exactly is the difference? Let’s look at an example.
Let’s say you have the column full name in your table. In this table, the full name is created by combining the first name and the last name. Imagine that in the full name column you have names like John Snow and Jack Snow. Now, let’s say that the name John Snow came to this table from your own CRM database, but Jack Snow came from a contact database acquired from a third party.
Record level lineage is able to tell you exactly that John Snow came from CRM and Jack Snow came from your contact database. Column level lineage, like in MANTA, is able to tell you that the column full name consists of data from these two databases—your CRM database and your contact database.
Why we don’t have it
The reason why MANTA does not have record level lineage is that MANTA doesn’t “see” your data; it doesn’t even “see” that you have a John Snow and a Jack Snow in your full name column. It only reads your metadata. That is why MANTA only sees a table that contains data from these databases and which databases they are.
Now, you might be thinking that the overall idea of the record level lineage approach might not be so bad after all. But keep in mind that if anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic has been executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage until they are executed, which is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.
Also, tags are formed by assigning additional metadata to the records. If you lose this metadata, you will never be able to form the lineage again. And without actually running the transformation engine, you don’t know how the given record was put together, and therefore don’t know the lineage behind it.
If MANTA wanted to have record level lineage, it would have to start reading your data instead of your metadata, and it would have to have much more information about your environment. This would make the entire process of getting data lineage far more complicated and time-consuming.
We can safely say that we are not planning on having record level lineage as a feature any time soon. On the other hand, we plan on putting more effort into understanding your data transformations. The fact that MANTA only reads your metadata and is only interested in your data transformations, not your actual data, is the reason why MANTA can be automated so well and get data lineage so fast.
And what about Conditional Lineage?
MANTA also has conditional lineage as a feature, and you do look into the actual data when you are creating conditional lineage. Well, not quite. We only use the data that is specifically mentioned in the scripts. You can learn more about conditional lineage in the article: How to Handle Impact Analyses in Complex DWHs with Predicates.
So what MANTA does give you is a list of the exact databases that supply data to the given column in your table. For compliance with regulations such as GDPR and other financial or banking regulations, it is completely sufficient. And typically, there are no more than a few databases that supply each column, so then the question is: If you really need to have the specific database for each record in your table THAT BAD and you can have the databases narrowed down to a few for each column in a couple of hours, wouldn’t it be more efficient to just check those two databases manually for the specific record yourself?
Do you have any development-related questions for Lukas, or would you like to learn more about how MANTA can solve a specific issue in your company? Don’t hesitate to contact us at firstname.lastname@example.org.