I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.
It is similar to BCBS or other past data-oriented regulations. Consulting companies, legal firms, data governance/ security/ metadata vendors, we are all the same — buy our “thing” and you will be ok or at least safer with us! Every second book out there tells us that every change is an opportunity to improve, to evolve. So, what is the improvement here with GDPR? If I look around I see a lot of legal work being done, adding tons of words (very small letters, as always) to already long Terms & Conditions. And you know what? I don’t think there is any real improvement in it.
But things are not always so bad. There is also a lot of good stuff going on with one important goal—to better understand and govern data and its lifecycle in a company. And there is one challenging but critical part I want to discuss today—Data Lineage. That means how data is moved around in your organization. You must understand that a customer’s email address or their credit card number is not just in your CRM but is spread all over your company in tens or even hundreds of systems — your ERP, data warehouse, reporting, new data lake with analytics, customer portal, numerous Excel sheets and even external systems. The path of the data you collect can be very complex, and if you think about all possible ways you can move and transform data in your company, one thing should be clear — your data lineage has to be automated as much as possible.
Different Approaches to Data Lineage
That being said, I feel it is important to talk about different approaches to data lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things — tables, columns, reports. But data lineage is more about logic — programming code in any form. It can be an SQL script, PL/SQL stored procedure, Java program or complex macro in your Excel sheet. It can literally be anything that somehow moves your data from one place to another, transforms it, modifies it. So, what are your options for understanding that logic?
This article is base on a presentation by Jan Ulrych at the DGIQ 2018 Conference. Click on any image to open the gallery.
Option 1: Ignore it! (aka data similarity lineage)
No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities. Tables, columns with similar names and columns with very similar data values are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I really hate). There is one great thing about this approach — if you watch data only, and not algorithms, you do not worry about technologies and it is no big deal if the customer uses Teradata, Oracle or MongoDB with Java on top of it. But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, which is very often requested by customers) and lineage is limited to the database world, ignoring the application part of your environment.
Option 2: Do the “business” lineage manually
This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, data stewards and data integration specialists should give you fair but often contradictory information about the movement of data in your organization. And if you miss talking to someone you simply don’t know about, a piece of the flow is missing! This often results in the dangerous situation where you have lineage but are unable to use it for real case scenarios — not only can you not trust your data, you cannot trust the lineage either.
Jan Ulrych presenting at DGIQ 2018.
Option 3: Do the technical lineage manually
I will get straight to the point here — trying to analyze technical flows manually is simply destined to fail. With the volume of code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially the need to reverse engineer the existing code, this becomes extremely time consuming and sooner or later such manually managed lineage will fall out of sync with the actual data transfers within the environment and you will end up with the feeling that you have lineage you cannot actually trust.
Now that we know that automation is key, let’s take a look at some less labor intensive and error prone approaches.
Option 4: Trace it! (aka data tagging lineage)
Do you know the story of Theseus and the Minotaur? The Minotaur lives in a labyrinth and so does Ariadne who is in charge of the labyrinth. Ariadne gives Theseus a ball of thread to help him navigate the labyrinth by being able to retrace his path.
And this approach is a bit similar. The whole idea is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label the whole way from start to finish. It is like Theseus. This approach looks great, but it only works well as long as the transformation engine controls the data’s every movement. A good example is a controlled environment like Cloudera. If anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic is executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage till they are executed. That is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.
Option 5: Control it! (aka self-lineage)
The whole idea here is that you have an all-in-one environment that gives you everything you need—you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like this, especially with the new big data/ data lake hype. If you have a software product of this kind, everything happens under its control — every data movement, every change in data. And so, it is easy for a such a tool to track lineage. But here you have the very same issue as in the previous case with data tagging. Everything that happens outside the controlled environment is invisible, especially when you consider long-term manageability. Over time, as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.
Option 6: Decode it! (aka decoded lineage)
Ok, so now we know that logic can be ignored, traced with tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly — it is constantly evolving. But there is still another way — to read all the logic, to understand it and to reverse engineer it. That literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic and XML based languages used by ETL tools and reports. And that is the challenging part. It is not easy to develop sufficient support for one language, let alone the tens of them you need in most cases to cover the basics of your environment. Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.
And that’s it. This was not meant to be a scientific article, but I wanted to show you the pros and cons of several popular data lineage approaches. Which leads me back to my first GDPR paragraph. I see enterprises investing a lot of money in data governance solutions with insufficient data lineage capabilities, offering tricks like data similarity, data tagging and even self-lineage. But that is just guesswork, nothing more. Guesswork with a lot of issues and manual labor to correct the lineage.
So, I am asking you once again — is guessing good enough for your GDPR project?
This article was also published on Tomas Kratky’s Linkedin Pulse.