Understand Your Data: The Ultimate Guide to Data Lineage

What is data lineage

The amount of data that companies store and process has skyrocketed over the past few years. Data is becoming more and more complicated and dynamic, and handling it properly and efficiently can sometimes seem impossible. Yes, welcome to the Age of Big Data.

Even though understanding data (where it comes from and how it is linked together) can help companies in many ways, there is still a significant number of enterprises that do not have their data lineage under control.

Data lineage shows what source(s) the data comes from, where is it flowing to in the environment, and—last but not least—what happens to it along the way. This can shed light on the role of particular data units in the environment as a whole.

Let’s talk some more about data units. Initially, a data unit is born somewhere in the environment. It can originate from different databases or be the result of various transformations. Data origin is actually the main idea behind the term data provenance, which can be understood as a subset of data lineage. But this is just the beginning of the data’s life cycle. Mostly, the data flows in many directions through an environment that consists of various platforms, and it passes through multiple processes. On its way, the data unit interacts with other data, is transformed, and is used in different reports.

Just imagine how many systems and data sources you have in your organization, how much data processing logic, how many ETL jobs, how many stored procedures, how many lines of programming code, how many reports, how many ad-hoc Excel sheets, etc. It is huge.

Let’s take a look at a specific case. The simplified diagram above shows how the customer acquisition cost was actually calculated. This metric shows us the expenditures associated with convincing the customer to buy our product or service.

Let’s imagine that the final value is the only thing we know. We need to go back and uncover the path that led to this figure. As a result, we will be able to find out what reports or even columns contributed to the final value. In this case, the customer acquisition cost came from a report containing the number of customers and marketing reports with total expenditures for particular marketing activities. We can also see what databases these reports are stored in. What is not obvious at first sight is that there are procedures hiding under each line that connects two objects. These procedures make the next step happen. Knowing all this information allows us to analyze the data inside and out. It is not necessary to always go backward in the lineage. Sometimes the starting point is somewhere in the middle of the path or at its beginning, when you need to make changes to reports or to source databases. In such cases, being able to track the lineage means we can easily predict what the implications of our decisions will be and how they will influence the output.

The role of metadata

Metadata is basically data about data. It is typically understood as assets and related information. When talking about metadata, we typically divide it into the following three categories.

TECHNICAL METADATA

TECHNICAL METADATA

Technical metadata, aka physical metadata, describes the physical storage of data. It typically covers databases and their schemas, views, and tables and the columns of those database views and tables, including details such as column data types, sizes, descriptions, and profiling information such as the most common value patterns, sample values, value frequency charts, percentages of values filled in, and data domains (e.g., a physical column that stores data such as address, phone number, e-mail address, etc.).

LOGICAL METADATA

LOGICAL METADATA

Logical metadata defines which details we store, but it doesn’t really define a business use of the entity or a physical representation. This information is generally managed in data modeling tools. It typically covers logical entities such as customers, parties, addresses, and definitions as well as what they mean and represent and links to the physical assets where these entities are actually represented.

BUSINESS METADATA

BUSINESS METADATA

Business metadata has specific meaning with regards to a particular business process—for example, a customer may be different for finance, risk, sales, and customer support. The organization needs to define what a customer is and determine whether and for what reasons different departments will use different definitions. This needs to be defined, agreed upon, documented, explained, and tracked along with related business terms, processes, etc.

HOW DOES LINEAGE RELATE?

HOW DOES LINEAGE RELATE?

Lineage connects all this information. Metadata without lineage is just a definition that is good to have but hard to use. And vice versa, lineage without metadata is hard to use because we have information about assets, but we typically need additional metadata such as a description and logical and business layers to really make use of it for specific use cases such as impact analyses, root-cause analyses, and compliance as well as for specific types of users, presenting information on the appropriate level of granularity such as high-level lineage (business lineage) for business-oriented users, logical-level lineage for analysts, and physical lineage for IT.

Technical vs. business lineage

The requirements for lineage vary depending on who the final audience is. Technical lineage (also known as physical lineage) is definitely a necessity for technical workers, such as software developers, who are interested in the detailed movements of data and its physical storage. It is also a “must have” in terms of a good governance solution.

On the other side is business lineage that allows the user to see data lineage from a higher perspective, without all the technical details like detailed transformations but rather with just the simplified descriptions. This option is ideal for “decision makers” since lineage at the technical level usually does not accommodate their needs. Business lineage, on the contrary, provides them with information at a level of detail that can serve as a sufficient basis for them to build their decisions on.


Why is data lineage so important

In God we trust. All others must bring data.

" W. Edwards Deming, American engineer, statistician, and management consultant"

Lots of companies do not worry about data lineage at all. Of course, that is an option. You can simply let it be without trying to discover what’s happening to the data in the environment, but at the end of the day, you won’t know what the path of the data unit is and how it is transformed. In the worst case, you won’t even know how and where it came into being.

Lots of companies do not worry about data lineage at all. Of course, that is an option. You can simply let it be without trying to discover what’s happening to the data in the environment, but at the end of the day, you won’t know what the path of the data unit is and how it is transformed. In the worst case, you won’t even know how and where it came into being.

And what are the consequences? Not knowing the above-mentioned information makes it impossible to predict what effects your actions could possibly have on the system. And, basically, that is the answer to the question posed by the heading in a nutshell. Now let’s crack it.

Having a complete understanding of your data, where it comes from, who uses it, and how it is modified, makes the data trustworthy. With that said, let’s take a closer look at these questions:

WHAT DATA DO WE STORE?


This is the first thing we definitely need to know. It might sound funny, but especially in data warehouses that have been developed over decades, this can become a real problem. The amount of data almost equals infinity, and the people who were there in the beginning do not work for the company anymore.

HOW AND WHERE IS IT USED?


What places is the data flowing to? Which jobs are responsible for these moves? This is also an important point of view. The data may reach places and create relationships we would never think of. This knowledge may help us improve data treatment processes (especially if we are talking about personal data) or just better understand the weight and impact of our decisions on the entire environment, such as what happens if we make changes to the data or remove it completely.

WHO USES THE DATA?


Awareness of who can use particular data is critical for proper management of access rights. A user could have rights to only a few systems, but data from other systems the user should not come in contact with could flow into these systems. This is crucial for both regulatory compliance and internal data security.

IS IT NECESSARY TO STORE IT?


This question is closely related to everything that is written in the paragraphs above and should actually be answered by answering all the previous questions.

Knowing what data we store enables us to identify the unused parts that died years ago and are now just occupying space we could free up. Also, time and other resources are being wasted when we process unnecessary data, a cost that nobody wants to see on their budget. Another situation with similar consequences is duplicate data.

Forbes magazine calls this kind of data “dark data” and describes it as:

Information assets that an organization collects, processes and stores in the course of its regular business activity, but generally fails to use for other purposes.


What can you use data lineage for?

Ronald Coase, British economist, and author

"Torture the data, and it will confess to anything."

Data lineage can become your best friend when faced with the following situations and resulting time pressure:

Trust in data and understanding it

This is the first “sphere of data lineage’s influence” and also a building block for all the other use cases below. If data lineage is being ignored or mapped inaccurately (which can be caused by lots of things, but, basically, it means using the wrong methods for the wrong mapping jobs), you cannot really trust your reports, by any means. Only when you have a complete understanding of your data, you can really rely on it and be able to make the most of it. And that results in growth in your overall efficiency. The better this part is done, the more effortless everyday work with data becomes.

Compliance

The number of regulations that require data lineage has increased rapidly over the past few years, and we can suppose that there are more of them waiting in line. BASEL, HIPAA, GDPR, CCAR… just to name a few. All of these have one thing in common—the company’s stakeholders (customers, auditors, employees, control authorities) need to have perfect trust in the data we report. Where is the data coming from? How did it get to its place? We might know the answers, but are we capable of proving it with up-to-date evidence whenever necessary? Or do we need weeks or months to complete a report which in the end is not completely reliable?

Root cause analysis

Imagine that you have been working on a project for a long time, but in the end, your labor does not bear the fruit you expected. You need to find out why, but everything seems to be fine as far as your eyes can see. You need to go deeper, under the surface, to find the source of the error, the spoiled root. And this is the challenging part, which often means hours and hours of manual labor if you do not have complete automated data lineage available.

Impact analysis

Properly done impact analyses improve businesses more than management can usually imagine. When done incorrectly they can result in delayed deliveries, low-quality deliveries, the need for emergency fixes, and extra work.

With automated data lineage it is much easier to quickly identify the impacts of changes throughout the entire environment. As a result, information about changes can easily be propagated to where itis applicable

Migrations

Anyone who has ever witnessed a migration project knows how complex process it is and what amount of labor is necessary to thoroughly scope the project. The project team needs to ensure that the data is secure and that it is also the right time to get rid of the parts that are not needed anymore so they can avoid migrating worthless data. But this is a very tricky task, as it requires the ability to view all dependencies within the environment prior to migration. This can help the project avoid future complications.

Data consolidation

Nowadays, companies register significant growth in data and its complexity. Integrating all the data from multiple sources in one place definitely has great benefits such as saving resources and streamlining IT management and is nearly a necessity for companies that are going through mergers or acquisitions. But the process can be quite painful without the right tool or consultancy. What needs to be done properly in the planning phase is estimating the time the whole project will take, what the hardware requirements are, and so on. That’s very hard to do without knowing how complex the environment is. And this is where data lineage comes riding in on a white horse.

Self-service enablement

What is often a hitch for scientists or data analysts is that they have to rely on IT to retrieve data they need. But as you can imagine, this can result in time wasting, delayed deliveries, or the data can become outdated by the time it’s received by the person who requested it. With access to the right tool, scientists and data analysts have the power to retrieve necessary up-to-date information on their own at the time they want it


Approaches to data lineage: How to create it and keep it up to date

Now that you know what data lineage is, why it is so important to your company, and what it can be used for, you are probably wondering how you can actually create data lineage.

It is important to talk about the different approaches to data lineage that are used by data governance vendors today because when you talk about metadata, you very often think of simple things—tables, columns, reports. But data lineage is more about logic—programming code in any form. It can be an SQL script, PL/SQL stored procedure, Java program, or complex macro in your Excel sheet. It can be literally anything that somehow moves your data from one place to another, transforms it, or modifies it. So, what are your options for understanding that logic?

Similarity
lineage
2%
Manual
lineage
3%
Data tagging
lineage
5%
Self-controlled
lineage
10%
Decoded
lineage
80%

Option 1: Similarity lineage

This includes products that build lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then, they use all that information to create lineage based on similarities. Tables or columns with similar names and columns with very similar data values are examples of such similarities. And, if you find a lot of them between two columns, you link them together in your data lineage diagram. Vendors might even call it AI. There is one big advantage to this approach—if you only watch data, and not algorithms, you do not have to worry about technologies and it is no big deal if the customer uses Teradata, Oracle, or MongoDB with Java on top. But this approach is not very accurate. The impact on performance can be significant (you work with data), and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic, for example, which is very often requested by customers) and the lineage is limited to the database world, ignoring the application part of your environment.

On the other hand, this approach may be sufficient for some cases involving replicated documents, where reading the logic hidden in your programming code and trying to understand it would actually be less efficient.

Option 2: Manual lineage

MANUAL BUSINESS LINEAGE

This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, data stewards, and data integration specialists should give you a fair amount of information, which is often contradictory, about the movement of data in your organization. And if you miss talking to someone you simply don’t know about, a piece of the flow is missing! This often results in a dangerous situation where you have lineage but are unable to use it for real case scenarios. Not only are you unable to trust your data, you cannot trust the lineage either. In the end, you’ll find out you need quality technical lineage to build the business lineage on.

MANUAL TECHNICAL LINEAGE

Trying to analyze technical flows manually is quite challenging. With the volume of code you have, the complexity of it, and the rate of change, there’s no way to make do with just this method, as it becomes unsustainable. When you start considering the complexity of the code and especially the need to reverse engineer the existing code, this becomes extremely time-consuming, and sooner or later such manually managed lineage will fall out of sync with the actual data transfers in the environment and you will end up with the feeling that you have lineage you cannot actually trust.

Despite all these cons, we cannot sideline this approach completely, as this is where we all need to start—to be able to gain insight into what is actually going on across the entire environment, and it is also necessary for tracking manual processes.

Now that we know that automation is key, let’s take a look at some less labor-intensive and error-prone approaches.

Option 3: Data tagging lineage

The idea behind data tagging is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label all the way from start to finish. This approach looks great, but it only works well if there is a transformation engine controlling every movement of the data. A good example is a controlled environment, like Cloudera or tools like Hadoop offer, that focuses only on the origin of one specific record. Another similar approach to creating data lineage is record level lineage.

The overall idea of this approach might not be so bad after all, but keep in mind that if anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic is executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage until they are executed, which is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.

Option 4: Self lineage

The idea behind this approach is that you have an all-in-one environment that gives you everything you need. You can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like this, especially with all the new big data/data lake hype. If you have software of this kind, it controls everything that happens—every data movement, every change in data. Thus, it is easy for such a tool to track lineage. But, here you have the very same issue with data tagging as in the previous case. Everything that happens outside the controlled environment is invisible, especially when you consider long-term manageability. Over time, as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.

Option 5: Decoded Lineage

So now we know that logic can be ignored, traced with tags, and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because your data environment is complex, heterogeneous, wild, and most importantly, it is constantly evolving. But there is still another way—to read all the logic, to understand it, and to reverse engineer it. That literally means understanding every programming language used in your organization for data transformations and movements. And by programming language, I mean really everything, including graphic and XML-based languages used by ETL tools and reports. And that is the challenging part.

It is not easy to develop sufficient support for one language, let alone the dozens of them you need in most cases to cover the basics of your environment. Another challenging issue is when the code is dynamic, which means that you have built your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.

Conclusion

These are the pros and cons of several popular data-lineage approaches. Enterprises invest a lot of money in data governance solutions with insufficient data lineage capabilities, offering tricks like data similarity, data tagging, and even self-lineage. But for some areas of data governance, these methods can be just guesswork, with a lot of issues requiring a great deal of manual labor to correct the lineage.

So, what is the lesson to be learned? It’s more about finding the right balance between these approaches than just choosing one method to cover all your needs. Use the most suitable approach or tool to do the right job at the right time.