Tomáš Krátký

Mess in, Mess Out: How Low Quality Data Ruins Your Analytics

September 9, 2019 by

A few days ago, I found an interesting article published by Moshe Kranc, CTO at Ness Digital Engineering, on Aug 1, 2019, at InformationWeek. He did an excellent job of reminding us all that with low-quality data the only result anyone can expect from analytics is a mess (mess in, mess out).

A few days ago, I found an interesting article published by Moshe Kranc, CTO at Ness Digital Engineering, on Aug 1, 2019, at InformationWeek. He did an excellent job of reminding us all that with low-quality data the only result anyone can expect from analytics is a mess (mess in, mess out).

I do not agree with his terminology when he says that “data is clean by instituting a set of capabilities and processes known collectively as data governance.” This contributes to the never-ending terminology war between what data governance really is and what the traditional term data management means. But it is a very minor issue.

What is more interesting is another sentence at the beginning of Moshe’s post addressing one of the benefits of data lineage: “standardization of the business vocabulary and terminology, which facilitates clear communication across business units.”

Lineage or Business Glossary—Which Goes First?

While almost every DG vendor tells you that data lineage is something too technical for the initial phases of your data governance program and that you should start with more “fundamental” tasks (like building your business glossary, for example), Moshe says something different—in order to understand your business terms and how they relate to each other, data lineage is essential. Because it shows you your data’s real journey; it helps you understand how different KPIs or critical data elements in your reports and dashboards are calculated and using what data. This is necessary to eliminate misunderstandings between different teams and units and build a high-quality business catalog.

We very often see teams starting with a DG tool, trying to implement basic processes and build a business vocabulary, manually analyzing lineage to decipher relationships among different terms, real data elements duplicated and distributed across so many different domains and systems.

Three Types of Lineage

The main part of the article is all about different types of lineage. Moshe sees three of them: decoded lineage (getting metadata from the code manipulating data), data similarity lineage (getting lineage by examining data and schemas), and manual lineage mapping. We wrote a similar article a year ago with one more type of lineage—check it out here. Moshe builds a list of pros and cons for each approach, a very good one. But I feel it is important to share a slightly different opinion.

Decoded Lineage as a Holy Grail (Or Not?)

Let’s start with decoded lineage first. One objection is that developing scanners for all technologies is impossible (very hard). That is true. But there are different approaches, and a solid data lineage platform should support them all, if possible. Besides traditional scanning, it is also possible to extend the existing code base with extra logging or to use a specialized library to monitor your calls and transformations in the code so you can see useful information. It is never as detailed as true decoded lineage, but it is still good enough. (We call it execution lineage.)

Another objection is that code versions change over time, so your analysis of the current code’s data flow may miss an important flow that has since been superseded. True again, but that is why every data lineage platform should support versioning so you can see changes in your lineage over time, compare lineage, and do other fun things for DevOps (or DataDevOps? or DataOps? [I am confused with so many new buzzwords]), security, or compliance reporting.

Dynamic code is a traditional objection that has already been resolved. What is more interesting are commands directly executed by DBA (to fix something, for example). But these are also easily resolved with execution lineage using specialized logs, for example (see above). Basic processes are necessary to make it possible, but nothing dramatic.

I was a bit confused by another objection—a decoded lineage tool will faithfully capture what the code does without raising a red flag (if the code violates GDPR rules, for example). I do not understand why this is wrong. I believe it is the only right thing to do—to reveal what is really happening so it can be detected and fixed. Typically, with the manual approach, we too often see people creating so-called make-believe lineage (lineage as they think it is instead of true lineage). And yes, the really important issue is how to use lineage information to detect possible violations of company policy. That is why a data lineage platform should offer you a way to build quality rules or a way to build them in your DQ tool and execute them against your data lineage platform.

Another confusing part is about duplicates in data. Two different processes work with the same data element and create a duplicate (for example, two duplicate data elements used in different reports instead of only one used by all reports). Yes, that happens very often, but it is not a reason to stop using the decoded lineage approach there. And even better—it is possible to detect duplicates by comparing processes/workflows and associated sources of data. MANTA is also able to understand calculations you have in the code (see our post about transformation logic), so comparing different processes for duplications is getting even easier.

Data Similarity Lineage

I agree with everything in that part. A lot of vendors try to trick customers by using this approach. The results can only be good in a very limited environment without too much complexity or logic. Another major issue is that there are no details about the transformations and calculations used to move and change data, which is very limiting for all super interesting use cases like impact and root-cause analyses, incident management, DevOps, and migration projects.

My opinion is that this type of lineage is satisfactory for basic governance use cases, but no real automation can be achieved if this is the major approach. On the other hand, we see this approach as complementary to decoded lineage for scenarios where no code (in any form) is available—for example, when sending data via email or copying data manually.

Manual Lineage Mapping

Manual is always make-believe lineage—how people see it, and not what the lineage really is. You cannot mix design (where manual is the right way) and reality. Lineage that differs from reality is of no help if you want to use it every day to increase your productivity and reduce risks. Another very important aspect, very often ignored by data officers, is that people do not enjoy doing boring very routine work like manually analyzing the environment and getting lineage from there. Manual is always part of the picture (to fix things that are simply not supported by the lineage platform), but it should be as tiny as possible.

One Ring to Rule Them All?

The result of Moshe’s analysis is inevitable—any really good solution should and needs to combine all the mentioned approaches (and a few more tricks based on our experience). I do not want to argue about the healthy division between decoded, similarity, and manual, but my opinion is very clear: decoded as much as possible, similarity only if there is no (like really) code/logic manipulating data, and manual never or, more precisely, only if there is no other option.

But there is one even more important insight, something I have learned over the past 20 years in data management. We created a concept of metadata that is too complex and vague (check this excellent 2016 article: Time To Kill Metadata). Then we replaced metadata management with data governance without understanding what it really means. And in the meantime, we introduced tons of new complexity—the cloud, big data, self-service, analytics, data privacy regulations, agile, etc.

Don’t get me wrong—all the concepts of unified metadata management are extremely interesting, but the whole problem is so broad that trying to build one unified solution is simply nonsense. Data lineage is an extremely important area connecting business, technology, and operations, independent from data governance, data quality, master data management, data security, or development but yet integrated with all of them. That is why we believe so much in the open ecosystem with a focus on powerful APIs. But more about that next time.

MANTA 2019: We wish health to all of you and your data as well!

December 31, 2018 by

Well, that’s a wrap! 2018 is behind us, so as you sit in your office chairs in the first few work days of 2019, read our last blog post about 2018 written by our CEO, Tomas Kratky.

Well, that’s a wrap! 2018 is behind us, so as you sit in your office chairs in the first few work days of 2019, read our last blog post about 2018 written by our CEO, Tomas Kratky.

For most people around the world, especially those in areas with Christian roots, the time around Christmas and New Year is somehow special. It starts with Christmas parties. There are so many of them everywhere that you can literally spend the whole first half of December going from one to the next. Later in December, no matter how busy our year is, we usually tend to slow down a bit and think about spending more time with family and friends. Even workaholics consider taking some time off. This magic is what I love so much about Christmastime.

The end of the year is also a good time to look back and thank the people who have supported us on our journeys. At Manta, we would like to thank, from the bottom of our hearts, all our customers, partners, and supporters. 2018 was an amazing year – we experienced triple-digit growth in customers, employees, and revenue. We also completed our first investment round last summer and welcomed two VCs on board. With their support, we are continuously improving our operations and internal processes to prepare ourselves for the coming years of growth.

I could fill pages with all the great things we achieved in 2018, but what would be the point in doing that? The most important thing is to say how proud we are to serve such great companies and how it is even better to see and enjoy the excitement that individual users and data professionals experience as a result of the lineage automation capabilities Manta brings them. In those moments, we feel like all the hard work and sleepless nights are worth it.

The end of the year is also a time flooded with different Top X predictions. Sometimes it is so boring to read again and again how technology X will disrupt the whole industry next year. This year, we are all talking about AI and machine learning. It is funny to see every technology out there suddenly “powered by AI & ML”, “bringing AI & ML to everyday life”, etc. But you know what – things are really changing, and AI & ML buzzwords are getting more and more real. Which is great. But like with big data a couple years back, we have to govern new initiatives properly and make them part of our data governance programs from day 1, otherwise we will end up with a big mess (as we did with data swamps). With no trust in the data you use for AI and ML, how can you trust the outcomes and results of your intelligent algorithms. Remember – mess in, mess out!

The end of the year is also the time for New Year’s resolutions and wishes. In Manta, we have no special resolution, just to keep on doing the same as we have been doing every year since we started (aligned with our mission to provide our customers with fully automated and complete navigation through their entire data and application landscape) – to push the boundaries of lineage automation and again understand a bit more, always maintain a can-do attitude, and continue to become slightly better versions of ourselves.

And my wish for 2019? As my grandma always says: “I wish you good health, my boy, and you will take care of the rest!” And she is right. Health is critical, not only for people but for data too. With bad data, we very much limit our ability to use it to its full potential. But the same way we sometimes harm ourselves (drinking, smoking, no sleep, no fitness, etc.), we can also do a lot of harm to our data. So my wish for 2019 is the best possible health to all of you and your data as well!

Happy New Year! Šťastný Nový Rok! Frohes Neues Jahr! Feliz Año Nuevo!

Different Approaches To Data Lineage

September 10, 2018 by

I feel it is important to talk about different approaches to Data Lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things – tables, columns, reports. But data lineage is more about logic.

I feel it is important to talk about different approaches to Data Lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things – tables, columns, reports. But data lineage is more about logic.

It is more about programming code in any form. It can be SQL script, PL/SQL stored procedure, Java program or complex macros in your Excel sheet. It can literally be anything that allows you to somehow move your data from one place to another, transform it, modify it. So, what are your options and how to understand that logic?

Option 1) Ignore it! (aka Data Similarity Lineage)

No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities.

Tables, columns with similar names or columns with very similar data values, those are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I hate very much). There is one great thing about this approach – if you watch data only and not algorithms, you do not care about technologies and there is no big deal if customer uses Teradata, Oracle or MongoDB with Java on top of it.

But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, very often requested by customers) and lineage is limited to the database world ignoring the application part of your environment.

Option 2) Do the “business” lineage manually

This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, Data stewards, data integration specialist should give you fair but often contradictory information about data movements in your organization. And if you miss asking someone you simply don’t know about; a piece of the flow is missing! This often results in dangerous situation when you’re having a lineage but unable to use it for real-case scenarios – not only you do not have trust in your data, but on in the lineage either.

Option 3) Do the technical lineage manually

I will go simply to the point here – trying to analyze the technical flows manually is simply destined to fail. With the volume of the code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially a need to reverse engineer the existing code this becomes extremely time consuming and sooner or later, such manually managed lineage falls out of sync with the actual data transfers within the environment and you end-up with a feeling of having lineage that you cannot actually trust.

Now, that we know that automation is the key, let’s take a look at less laboring and error prone approaches.

Option 4) Trace it! (aka Data Tagging Lineage)

Do you know the story of Theseus and the Minotaur? Minotaur lives in a labyrinth and so does Ariadne who was in charge of the labyrinth. Ariadne gave Theseus a ball of thread to help him navigate the labyrinth by tracing his path back.And this is a little bit similar approach.

The whole idea is that each piece of data that is being moved or transformed is tagged / labeled by a transformation engine which then tracks that label all its way from start to the end. It is like that Theseus. This approach looks great but it works well only as long as a transformation engine controls every movement of data. A good example is a controlled environment like Cloudera.

But anything happens outside its walls and lineage is broken. It is also important to realize that lineage is there only if transformation logic is executed. But think about all exceptions or rules that apply only once per a couple of years. You will not see them in your lineage till they are executed which is not exactly healthy for your data governance. Especially if some of those pieces are critical to your organization.

Option 5) Control it! (aka Self Lineage)

The whole idea here is that you have all-in-one environment that gives you everything you need – you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like that, especially with the new Big Data / Data Lake hype. If you have a software product of this kind, all happens under its control, every data movement, every change of data. And so, it is easy for such a tool to track the lineage.

But there is the very same issue as in the previous case with Data Tagging. All what happens outside the controlled environment is invisible. Especially when you think of long term manageability, over the time as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.

Option 6) Decode it! (aka Decoded Lineage)

Ok, so now we know that logic can be ignored, traced thanks to tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly – constantly evolves.

But there is still one other way – to read all the logic, to understand it and reverse engineer it. It literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic or XML based languages used by ETL tools or reports. And that is the challenging part. It is not easy to develop sufficient support for one language, you need tens of them in most cases to cover basics of your environment.

Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways how to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.

An earlier version of this article was published on Tomas Kratky’s LinkedIn Pulse.

Is Guessing Good Enough for Your GDPR Project?

June 28, 2018 by

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.

It is similar to BCBS or other past data-oriented regulations. Consulting companies, legal firms, data governance/ security/ metadata vendors, we are all the same — buy our “thing” and you will be ok or at least safer with us! Every second book out there tells us that every change is an opportunity to improve, to evolve. So, what is the improvement here with GDPR? If I look around I see a lot of legal work being done, adding tons of words (very small letters, as always) to already long Terms & Conditions. And you know what? I don’t think there is any real improvement in it.

But things are not always so bad. There is also a lot of good stuff going on with one important goal—to better understand and govern data and its lifecycle in a company. And there is one challenging but critical part I want to discuss today—Data Lineage. That means how data is moved around in your organization. You must understand that a customer’s email address or their credit card number is not just in your CRM but is spread all over your company in tens or even hundreds of systems — your ERP, data warehouse, reporting, new data lake with analytics, customer portal, numerous Excel sheets and even external systems. The path of the data you collect can be very complex, and if you think about all possible ways you can move and transform data in your company, one thing should be clear — your data lineage has to be automated as much as possible.

Different Approaches to Data Lineage

That being said, I feel it is important to talk about different approaches to data lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things — tables, columns, reports. But data lineage is more about logic — programming code in any form. It can be an SQL script, PL/SQL stored procedure, Java program or complex macro in your Excel sheet. It can literally be anything that somehow moves your data from one place to another, transforms it, modifies it. So, what are your options for understanding that logic?

This article is base on a presentation by Jan Ulrych at the DGIQ 2018 Conference. Click on any image to open the gallery.

Option 1: Ignore it! (aka data similarity lineage)

No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities. Tables, columns with similar names and columns with very similar data values are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I really hate). There is one great thing about this approach — if you watch data only, and not algorithms, you do not worry about technologies and it is no big deal if the customer uses Teradata, Oracle or MongoDB with Java on top of it. But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, which is very often requested by customers) and lineage is limited to the database world, ignoring the application part of your environment.

Option 2: Do the “business” lineage manually

This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, data stewards and data integration specialists should give you fair but often contradictory information about the movement of data in your organization. And if you miss talking to someone you simply don’t know about, a piece of the flow is missing! This often results in the dangerous situation where you have lineage but are unable to use it for real case scenarios — not only can you not trust your data, you cannot trust the lineage either.

Jan Ulrych presenting at DGIQ 2018.

Option 3: Do the technical lineage manually

I will get straight to the point here — trying to analyze technical flows manually is simply destined to fail. With the volume of code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially the need to reverse engineer the existing code, this becomes extremely time consuming and sooner or later such manually managed lineage will fall out of sync with the actual data transfers within the environment and you will end up with the feeling that you have lineage you cannot actually trust.
Now that we know that automation is key, let’s take a look at some less labor intensive and error prone approaches.

Option 4: Trace it! (aka data tagging lineage)

Do you know the story of Theseus and the Minotaur? The Minotaur lives in a labyrinth and so does Ariadne who is in charge of the labyrinth. Ariadne gives Theseus a ball of thread to help him navigate the labyrinth by being able to retrace his path.

And this approach is a bit similar. The whole idea is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label the whole way from start to finish. It is like Theseus. This approach looks great, but it only works well as long as the transformation engine controls the data’s every movement. A good example is a controlled environment like Cloudera. If anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic is executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage till they are executed. That is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.

Option 5: Control it! (aka self-lineage)

The whole idea here is that you have an all-in-one environment that gives you everything you need—you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like this, especially with the new big data/ data lake hype. If you have a software product of this kind, everything happens under its control — every data movement, every change in data. And so, it is easy for a such a tool to track lineage. But here you have the very same issue as in the previous case with data tagging. Everything that happens outside the controlled environment is invisible, especially when you consider long-term manageability. Over time, as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.

Option 6: Decode it! (aka decoded lineage)

Ok, so now we know that logic can be ignored, traced with tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly — it is constantly evolving. But there is still another way — to read all the logic, to understand it and to reverse engineer it. That literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic and XML based languages used by ETL tools and reports. And that is the challenging part. It is not easy to develop sufficient support for one language, let alone the tens of them you need in most cases to cover the basics of your environment. Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.

And that’s it. This was not meant to be a scientific article, but I wanted to show you the pros and cons of several popular data lineage approaches. Which leads me back to my first GDPR paragraph. I see enterprises investing a lot of money in data governance solutions with insufficient data lineage capabilities, offering tricks like data similarity, data tagging and even self-lineage. But that is just guesswork, nothing more. Guesswork with a lot of issues and manual labor to correct the lineage.

So, I am asking you once again — is guessing good enough for your GDPR project?

This article was also published on Tomas Kratky’s Linkedin Pulse.

The Year of MANTA and Why We’ve Published Our Pricing Online

December 31, 2017 by

We’ve seen a massive surge in the world of data lineage over the last year.  More buzz, more leads, more customers for us and (from what I’ve heard) for other metadata players as well. It might come as a bit of a disruption, but we’ve decided to do something which is very common in other industries, but not in ours. We’ve published our pricing online. Why?

We’ve seen a massive surge in the world of data lineage over the last year.  More buzz, more leads, more customers for us and (from what I’ve heard) for other metadata players as well. It might come as a bit of a disruption, but we’ve decided to do something which is very common in other industries, but not in ours. We’ve published our pricing online. Why?

The Year We’ve Come Through

2017 is coming to an end, and so it is the right time to take a look back. It was a very hot year for metadata and data governance. Partially thanks to the new GDPR regulation, but there are more reasons behind it – more and more enterprises have come to the understanding that the only way to build an efficient data-driven company is through proper data governance. In 2016, data got a lot of attention, how big it is or can potentially be, how to manage large volumes, velocity, and variety in data.

In 2017, we all started to realize that it is not just about data, but also a lot about data algorithms – the way your data is and how it’s gathered, transferred, merged, processed, and moved around your company. Thanks to GDPR, internal discussions have been initiated about how and where sensitive / protected data elements are used, and suddenly, it turns out that we are flooded not just with data but with data algorithms too, and it is impossible to handle it all manually without automation.

That has drawn even more attention to MANTA and its unique data lineage automation capabilities. Our website basically exploded – our audience doubled and the use of our live demo nearly tripled. We have on-boarded several amazing new customers from all around the world, and we delivered four major releases this year, with plenty of new features in all of them including Public & Service APIs and new technologies (SSIS, IBM Netezza, IBM DB2, and Impala, to name a few). Simply said, 2017 was a fantastic year and more is coming in 2018!

And even though this year was yet another giant step for MANTA, we decided to do one more thing that will shake things up. We’ve done something that’s pretty common in all the other industries except ours.

Yes, we’ve published our pricing online for everyone to see.

And why?

MANTA is taking the lead in transparency and openness

Sometimes there are good reasons for hiding the price of your product or service. And it is common practice in the enterprise software industry. But does that really make sense? Let’s take a look at the usual reasons, then:

1) You might be legally bound to hide the price by a government or its suppliers. Yes, national security is a serious issue, and there might be some limitations put on companies which deal with it. But that works only for individual deals and is hardly a reason to hide the price.

2) You want to participate in tenders with secret bids. Yes, that also makes sense – especially when you are dealing with clients that focus only on the price. You would not want to lose just because your bid is a few thousand higher, would you? Perhaps not, but this is not our case – MANTA is a very unique software product with clear and easy to see value for its users. The price has to be reasonable, but it is rarely a way how to win anyone’s business.

3) You want to keep everybody in the dark. Yes, some do want that. But frankly, it’s a rather dishonest strategy. It’s foolish to expect that customers do not know other players on the market and their prices. It’s even more foolish to try to control the market by spreading rumors and making deals in the shadows.

When you are confident of your product and what it stands for, you are also confident of its price. There’s no reason to follow the “industry standard” by not disclosing the enterprise IT product prices. So dive into our pricing right here and if there’s something that needs clarification, just take a look at our pricing glossary right below it.

Thank you for your support this year and see you in 2018!

Yours,

Tomas

 

We cherish your privacy.

And we need to tell you that this site uses cookies. Learn more in our Privacy Policy.