Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

MANTA Business
September 9, 2019
Tomáš Krátký

A few days ago, I found an interesting article published by Moshe Kranc, CTO at Ness Digital Engineering, on Aug 1, 2019, at InformationWeek. He did an excellent job of reminding us all that with low-quality data the only result anyone can expect from analytics is a mess (mess in, mess out).

I do not agree with his terminology when he says that “data is clean by instituting a set of capabilities and processes known collectively as data governance.” This contributes to the never-ending terminology war between what data governance really is and what the traditional term data management means. But it is a very minor issue.

What is more interesting is another sentence at the beginning of Moshe’s post addressing one of the benefits of data lineage: “standardization of the business vocabulary and terminology, which facilitates clear communication across business units.”

Lineage or Business Glossary—Which Goes First?

While almost every DG vendor tells you that data lineage is something too technical for the initial phases of your data governance program and that you should start with more “fundamental” tasks (like building your business glossary, for example), Moshe says something different—in order to understand your business terms and how they relate to each other, data lineage is essential. Because it shows you your data’s real journey; it helps you understand how different KPIs or critical data elements in your reports and dashboards are calculated and using what data. This is necessary to eliminate misunderstandings between different teams and units and build a high-quality business catalog.

We very often see teams starting with a DG tool, trying to implement basic processes and build a business vocabulary, manually analyzing lineage to decipher relationships among different terms, real data elements duplicated and distributed across so many different domains and systems.

Three Types of Lineage

The main part of the article is all about different types of lineage. Moshe sees three of them: decoded lineage (getting metadata from the code manipulating data), data similarity lineage (getting lineage by examining data and schemas), and manual lineage mapping. We wrote a similar article a year ago with one more type of lineage—check it out here. Moshe builds a list of pros and cons for each approach, a very good one. But I feel it is important to share a slightly different opinion.

Decoded Lineage as a Holy Grail (Or Not?)

Let’s start with decoded lineage first. One objection is that developing scanners for all technologies is impossible (very hard). That is true. But there are different approaches, and a solid data lineage platform should support them all, if possible. Besides traditional scanning, it is also possible to extend the existing code base with extra logging or to use a specialized library to monitor your calls and transformations in the code so you can see useful information. It is never as detailed as true decoded lineage, but it is still good enough. (We call it execution lineage.)

Another objection is that code versions change over time, so your analysis of the current code’s data flow may miss an important flow that has since been superseded. True again, but that is why every data lineage platform should support versioning so you can see changes in your lineage over time, compare lineage, and do other fun things for DevOps (or DataDevOps? or DataOps? [I am confused with so many new buzzwords]), security, or compliance reporting.

Dynamic code is a traditional objection that has already been resolved. What is more interesting are commands directly executed by DBA (to fix something, for example). But these are also easily resolved with execution lineage using specialized logs, for example (see above). Basic processes are necessary to make it possible, but nothing dramatic.

I was a bit confused by another objection—a decoded lineage tool will faithfully capture what the code does without raising a red flag (if the code violates GDPR rules, for example). I do not understand why this is wrong. I believe it is the only right thing to do—to reveal what is really happening so it can be detected and fixed. Typically, with the manual approach, we too often see people creating so-called make-believe lineage (lineage as they think it is instead of true lineage). And yes, the really important issue is how to use lineage information to detect possible violations of company policy. That is why a data lineage platform should offer you a way to build quality rules or a way to build them in your DQ tool and execute them against your data lineage platform.

Another confusing part is about duplicates in data. Two different processes work with the same data element and create a duplicate (for example, two duplicate data elements used in different reports instead of only one used by all reports). Yes, that happens very often, but it is not a reason to stop using the decoded lineage approach there. And even better—it is possible to detect duplicates by comparing processes/workflows and associated sources of data. MANTA is also able to understand calculations you have in the code (see our post about transformation logic), so comparing different processes for duplications is getting even easier.

Data Similarity Lineage

I agree with everything in that part. A lot of vendors try to trick customers by using this approach. The results can only be good in a very limited environment without too much complexity or logic. Another major issue is that there are no details about the transformations and calculations used to move and change data, which is very limiting for all super interesting use cases like impact and root-cause analyses, incident management, DevOps, and migration projects.

My opinion is that this type of lineage is satisfactory for basic governance use cases, but no real automation can be achieved if this is the major approach. On the other hand, we see this approach as complementary to decoded lineage for scenarios where no code (in any form) is available—for example, when sending data via email or copying data manually.

Manual Lineage Mapping

Manual is always make-believe lineage—how people see it, and not what the lineage really is. You cannot mix design (where manual is the right way) and reality. Lineage that differs from reality is of no help if you want to use it every day to increase your productivity and reduce risks. Another very important aspect, very often ignored by data officers, is that people do not enjoy doing boring very routine work like manually analyzing the environment and getting lineage from there. Manual is always part of the picture (to fix things that are simply not supported by the lineage platform), but it should be as tiny as possible.

One Ring to Rule Them All?

The result of Moshe’s analysis is inevitable—any really good solution should and needs to combine all the mentioned approaches (and a few more tricks based on our experience). I do not want to argue about the healthy division between decoded, similarity, and manual, but my opinion is very clear: decoded as much as possible, similarity only if there is no (like really) code/logic manipulating data, and manual never or, more precisely, only if there is no other option.

But there is one even more important insight, something I have learned over the past 20 years in data management. We created a concept of metadata that is too complex and vague (check this excellent 2016 article: Time To Kill Metadata). Then we replaced metadata management with data governance without understanding what it really means. And in the meantime, we introduced tons of new complexity—the cloud, big data, self-service, analytics, data privacy regulations, agile, etc.

Don’t get me wrong—all the concepts of unified metadata management are extremely interesting, but the whole problem is so broad that trying to build one unified solution is simply nonsense. Data lineage is an extremely important area connecting business, technology, and operations, independent from data governance, data quality, master data management, data security, or development but yet integrated with all of them. That is why we believe so much in the open ecosystem with a focus on powerful APIs. But more about that next time.