Lukáš Hermann

MANTA x Record Level Lineage: Why we don’t have it

You may or may not have heard about record level lineage. This is a topic that our customers ask about quite frequently, so our vice president of development, Lukas Hermann, decided to write an article where he answers some of the FAQs. Continue reading to find out more about record level lineage and why we don’t have it.

You may or may not have heard about record level lineage. This is a topic that our customers ask about quite frequently, so our vice president of development, Lukas Hermann, decided to write an article where he answers some of the FAQs. Continue reading to find out more about record level lineage and why we don’t have it.

What is record level lineage?

Record level lineage is an approach to data lineage that is similar to data tagging. The idea behind data tracking is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label all along its way from start to finish. This approach seems great, but it only works well when a transformation engine controls the data’s every move. Some good examples are controlled environments like Cloudera or Dremeo that focus only on the origin of one specific record.

Record level lineage vs. column level lineage

A feature that MANTA does have, that in a way is similar to record level lineage, is column level lineage. What exactly is the difference? Let’s look at an example.

Let’s say you have the column full name in your table. In this table, the full name is created by combining the first name and the last name. Imagine that in the full name column you have names like John Snow and Jack Snow. Now, let’s say that the name John Snow came to this table from your own CRM database, but Jack Snow came from a contact database acquired from a third party.

Record level lineage is able to tell you exactly that John Snow came from CRM and Jack Snow came from your contact database. Column level lineage, like in MANTA, is able to tell you that the column full name consists of data from these two databases—your CRM database and your contact database.

Why we don’t have it

The reason why MANTA does not have record level lineage is that MANTA doesn’t “see” your data; it doesn’t even “see” that you have a John Snow and a Jack Snow in your full name column. It only reads your metadata. That is why MANTA only sees a table that contains data from these databases and which databases they are.

Now, you might be thinking that the overall idea of the record level lineage approach might not be so bad after all. But keep in mind that if anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic has been executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage until they are executed, which is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.

Also, tags are formed by assigning additional metadata to the records. If you lose this metadata, you will never be able to form the lineage again. And without actually running the transformation engine, you don’t know how the given record was put together, and therefore don’t know the lineage behind it.

In conclusion

If MANTA wanted to have record level lineage, it would have to start reading your data instead of your metadata, and it would have to have much more information about your environment. This would make the entire process of getting data lineage far more complicated and time-consuming.

We can safely say that we are not planning on having record level lineage as a feature any time soon. On the other hand, we plan on putting more effort into understanding your data transformations. The fact that MANTA only reads your metadata and is only interested in your data transformations, not your actual data, is the reason why MANTA can be automated so well and get data lineage so fast.

And what about Conditional Lineage? 

MANTA also has conditional lineage as a feature, and you do look into the actual data when you are creating conditional lineage. Well, not quite. We only use the data that is specifically mentioned in the scripts. You can learn more about conditional lineage in the article: How to Handle Impact Analyses in Complex DWHs with Predicates.

So what MANTA does give you is a list of the exact databases that supply data to the given column in your table. For compliance with regulations such as GDPR and other financial or banking regulations, it is completely sufficient. And typically, there are no more than a few databases that supply each column, so then the question is: If you really need to have the specific database for each record in your table THAT BAD and you can have the databases narrowed down to a few for each column in a couple of hours, wouldn’t it be more efficient to just check those two databases manually for the specific record yourself?

Do you have any development-related questions for Lukas, or would you like to learn more about how MANTA can solve a specific issue in your company? Don’t hesitate to contact us at manta@getmanta.com.

How to Solve Impact Analysis with MANTA

Our customers use MANTA for all kinds of projects, impact analysis being one of them. When you have a really complex BI environment and still want to perform a reliable impact analysis, using predicates (and MANTA) is one way! Keep on reading to learn how.

Our customers use MANTA for all kinds of projects, impact analysis being one of them. When you have a really complex BI environment and still want to perform a reliable impact analysis, using predicates (and MANTA) is one way! Keep on reading to learn how.

During our pilots and deployments, we often find data warehouse environments that use very general physical models including several big tables like PARTY, BALANCE, ORDER, and others.

These tables contain data obtained from various source systems, and there are a lot of data marts and reports built on top of them. These tables make things difficult during impact analysis because data lineage from almost every report goes through them and into all the sources, making the results hard to use, or even worthless.

Impact Analysis Is Easier Than Ever Before

Let’s take a closer look at an example to understand exactly what happens when you use MANTA for your impact analysis. The table PARTY contains all information about individuals and companies that are somehow related to the organization. Thus, in one table, it is possible to have records for clients, employees, suppliers, and the organization’s branch network. Each type of entity is identified by the unique attribute or source system from which the data is obtained – for example, clients are managed in a different system than employees.

Now, let’s assume that the data from the PARTY table goes into two separate reports – a report EMPL_REPORT that displays information about employees and another report BRANCH_REPORT that displays information about the branch network. If we use the standard data lineage analysis, we can get this picture:

Although only data from the EMPLOYEE source table is relevant for the report EMPL_REPORT, the impact analysis from that report also includes the CLIENT, BRANCH, and SUPPLIER source tables due to the PARTY table. The problem is the same for the report BRANCH_REPORT. From the other side, the impact analysis from the EMPLOYEE source table includes both the EMPL_REPORT and BRANCH_REPORT which is confusing.

The Advantage of Using Data Lineage

Luckily, there is a solution. When data is inserted into the PARTY table from different source systems, there is often a column like PARTY.source_system_id where the identification of the source system is stored as a constant. Similarly, when a report is created that consumes data only from specific source systems, there is a condition in the statement filtering data based on the PARTY.source_system_id column. Thus, it is possible to automatically analyze both the insertion and selection to/from the PARTY table and create predicates such as PARTY.source_system_id = 20 that are then stored together with data lineage in the metadata repository. Therefore, it is possible to include them in the computation during the impact analysis.

Thanks to that, if we perform an impact analysis from the report EMPL_REPORT, the predicate PARTY.source_system_id = 20 is gathered before the table PARTY. When the analysis continues towards source tables, the predicate for each path is selected and compared to what has already been gathered. Therefore, when the path to the source table CLIENT with the predicate PARTY.source_system_id = 10 is tested, the result is that both predicates cannot hold at once, so data for this report cannot come from this source table. Conversely, when the path to the source table EMPLOYEE with the predicate PARTY.souce_system_id = 20 is tested, the result is that data for this report can come from this source table, so it is included in the results of the impact analysis. We can get similar results if we perform an impact analysis for the BRANCH_REPORT and also from sources like the EMPLOYEE table.

The results of the advanced data lineage analysis can look like this (in reality, if we perform the impact analysis from the EMPL_REPORT, we will only see the EMPLOYEE and PARTY tables):

Surely, the situation can be far more complex. For example, the data from the PARTY table can be pre-computed for more source systems first, and then several reports can be created on top of them for only a specific source system, like in this picture:

If you have any questions or comments, feel free to contact Lukas at manta@mantatools.com. You can try these predicate-based impact analyses in our free trial!

 

 

MANTA 3.21: Support for Informatica EDC, Microsoft SSAS & more!

It’s release time! We have just launched MANTA version 3.21, and there are some pretty nice improvements and two new integrations! Read more to see what’s new!

It’s release time! We have just launched MANTA version 3.21, and there are some pretty nice improvements and two new integrations! Read more to see what’s new!

The New Informatica EDC Is Now Supported

MANTA has released its first version of the integration with Informatica Enterprise Data Catalog 10.2. MANTA can:

  1. Connect to the same databases as EDC and scan all DDL scripts stored there
  2. Automatically push in data lineage that integrates with EDC’s native resources, enriching the previous data lineage with stored procedures, pieces of SQL code that were hidden before because EDC has a hard time reading them, and other pieces of programming code
  3. Make every Informatica EDC + MANTA customer happy!

Expanding Our Microsoft SSAS Support

The next big integration that this release brings you, is support for the tabular models of Microsoft SQL Server Analysis Services (SSAS) 2012 or newer.  SSAS allows the customer to define the dimensions (master tables) and the facts (measurable details) in the data mart and is then able to measure and process them, helping the customer create reports. MANTA can now access the analytical level of your business intelligence environment to create complete end-to-end data lineage for your Microsoft environment.

MANTA’s developers have also spent a great deal of time on other Microsoft technologies. (SPOILER ALERT: Our next release in Q3 will support SSRS, making MANTA fully on board with all Microsoft SQL Server technologies!)

Other technological improvements include large Oracle updates, updates for IBM DB2 and IBM Netezza (now known as IBM PureData for Analytics Powered by Netezza), and some tweaks and fixes to our native visualization.

Version 3.21 also supports what we call enterprise features, which allow the assignment of individual repository access rights to certain groups of employees for accessing specific resources, databases, or even schemas.  And last but not least, automatic differences between revisions are also available in our native visualization from now on.

Is there anything else you would like to know about MANTA 3.21? Don’t hesitate to contact us at manta@getmanta.com (or use the form). We are always glad to help!

Automated, Painless Proof-of-Concept? Learn How MANTA Does It

The PoC (Proof of Concept) is a standard process intended to show potential MANTA customers that our solution works in their environments. To give you some insight into the process itself, we will walk you through it. Come right this way!

The PoC (Proof of Concept) is a standard process intended to show potential MANTA customers that our solution works in their environments. To give you some insight into the process itself, we will walk you through it. Come right this way!

Well, here’s just four things you should know:

1. It’s Automated

Automation has always been at the core of MANTA’s design. And we are very proud to say that our PoCs are easy, fast, and painless because of the amount of automation we can offer. There is one catch, though. MANTA needs to be installed (usually locally, but a remote option is also available) and connected to your environment (obviously non-production). In many organizations, that requires some effort – connections need to be approved by security officers and that might take a while. But how can you test anything without letting it – you know – do its thing?

2. It’s Free

Have you read number one? To put it simply: MANTA’s standard proofs-of-concept is free of charge because it’s effortless on both sides, and takes about 30 days (but most of it is within one week!). We know that many companies charge anything from $1000 to approximately a gazillion dollars to do a PoC. Our usual PoC is made to be quick and painless.

However, if the environment is complex, with a lot of proprietary steps and running on multiple platforms, we might not be able to run our standard PoC. Then there would be a made-to-measure PoC, with a fee depending on the amount of custom work required. It’s not a full integration, so we always agree on the scope, clear terms and conditions with each customer at the beginning so nobody is left in the wind.

3. It’s Properly Tracked & Documented

And that’s why we have everything logged via a simple, yet powerful documentation process. And how does that work?

There are usually five people who participate in the PoC process:

  • Sales and Technical Support experts from MANTA (That’s two and only two, we promise. No need to get overwhelmed.)
  • And on the customer’s side: PoC Manager for the process (let’s call him the champion, shall we?), Technical Participant who knows the environment, and Evaluation Analyst capable of evaluating results and the added value of MANTA in the overall solution implementation process (that’s three, obviously not all of them are needed at any given moment)

Set-Up

The process itself usually kicks off with a pre-installation phase, where the customer and MANTA check-off a list of prerequisites prior to the installation phase. These vary based on the technologies selected for the PoC, but we do have a list of requirements ready. After that comes the installation phase, where the future administrator/manager of MANTA is trained on the application and the software is installed into the customer’s system, configured and, finally, tested!

Live Testing

Then comes an inspection of the KPIs (Key Performance Indicators) that serves to verify how the solutions work. After the future MANTA users are trained on the software, they summarize how satisfied they were with the test.

The KPIs usually include:

  • the amount of automatically visualized data lineage and if it is visualized correctly
  • the amount of labor saved on a selected task, for example impact analysis, regulatory compliance, self-service or pretty much anything else (above 80% is what we aim to achieve)

We track this process in our system, MANTA Help Desk & Knowledge Base. The accounts are free of charge for any number of people necessary for the PoC, and later, for the actual use of MANTA.

4. It’s Not Demanding for the Customer Either

It’s completed in 30 days, but most of the work is done within a week or so. Yes, you can install-test-uninstall MANTA way faster (within one day or less), but these are the best practices that help our customers to achieve the most successful PoC possible (and hey, it’s still faster and cheaper than the other guys, right?).

For the technical part of the process, a physical or virtual machine with these specs is needed:

  • CPU minimum: 4 cores at 2.5 GHz
  • RAM minimum: 8 GB
  • HDD: 100 MB for Manta + space for metadata up to several dozens of GBs

There’s no need to have a dedicated machine, a virtual machine with these resources is also ok. You can use existing machine to start the PoC as soon as possible.  Check out our supported technologies list for stuff that we can parse, and let’s just say that machine-wise, anything you can run Java on should be just fine. Or just ask us at manta@getmanta.com.

The actual Proof-of-Concept Document

If you are interested, you can review our actual Proof-of-Concept Document right here:

Please, note that this is the universal, long document that will be significantly downscaled for the exact combination of technologies used in your case:

Download the PoC Document now.

 

How to Handle Impact Analyses in Complex DWHs with Predicates

“How to get full data lineage in complex BI environments and perform reliable impact analyses?” Predicates (with the help of Manta Flow!) might be the answer. 

“How to get full data lineage in complex BI environments and perform reliable impact analyses?” Predicates (with the help of Manta Flow!) might be the answer. 

During our pilots and deployments, we often find data warehouse environments that use very general physical models including several big tables like PARTY, BALANCE, ORDER and others. These tables contain data obtained from various source systems, and there are a lot of data marts and reports built on top of them. These tables make things difficult during the impact analysis because data lineage from almost every report goes through them to all sources making the result worthless.

Impact Analyses Do Not Have to Be THIS BIG 

Let’s take a look at an example to understand exactly what happens. The table PARTY contains all individuals and companies that are somehow related to the organization. Thus, in one table, it is possible to have records for clients, employees, suppliers and its branch network. Each type of entity is identified by a unique attribute or source system from which data is obtained – for example, clients are managed in a different system than employees.

Now, let’s assume we have two reports based on data from the PARTY table – a report EMPL_REPORT that displays information about employees and another report BRANCH_REPORT that displays information about the branch network. If we use the standard data lineage analysis, we can get this picture:

predicates1

Although only data from the EMPLOYEE source table is relevant for the report EMPL_REPORT, the impact analysis from that report also includes the CLIENT, BRANCH and SUPPLIER source tables due to the PARTY table. The problem is the same for the report BRANCH_REPORT. From the other side, the impact analysis from the EMPLOYEE source table includes both the EMPL_REPORT and BRANCH_REPORT which is confusing.
In the real environment, there are dozens of source systems and hundreds of reports, which makes the standard data lineage analysis worthless.

The Advanced Data Lineage Analysis 

Fortunately, there is a solution. When data is inserted into the PARTY table from different source systems, there is often a column like PARTY.source_system_id where the identification of the source system is stored as a constant. Similarly, when a report is created that consumes data only from specific source systems, there is a condition in the statement filtering data based on the PARTY.source_system_id column. Thus, it is possible to automatically analyze both the insertion and selection to/from the PARTY table and create predicates such as PARTY.source_system_id = 20 that are then stored together with data lineage in the metadata repository. Therefore, it is possible to include them in the computation during the impact analysis.

Thanks to that, if we perform an impact analysis from the report EMPL_REPORT, the predicate PARTY.source_system_id = 20 is gathered before the table PARTY. When the analysis continues towards source tables, the predicate for each path is selected and compared to what has already been gathered. Therefore, when the path to the source table CLIENT with the predicate PARTY.source_system_id = 10 is tested, the result is that both predicates cannot hold at once, so data for this report cannot come from this source table. Conversely, when the path to source table EMPLOYEE with the predicate PARTY.souce_system_id = 20 is tested, the result is that data for this report can come from this source table, so it is included in the result of the impact analysis. We can get similar results if we perform an impact analysis for the BRANCH_REPORT and also from sources like the EMPLOYEE table.

The result of the advanced data lineage analysis can look like this (in reality, if we perform the impact analysis from the EMPL_REPORT, we will only see the EMPLOYEE and PARTY tables):

predicates2

Surely, the situation can be far more complex. For example, the data from the PARTY table can be pre-computed for more source systems first, and then several reports can be created on top of them for only a specific source system, like in this picture:

predicates3

This is also something that can be handled and, as you may have expected, even this is a part of the Manta Flow product analysis.

If you have any questions or comments, feel free to contact Lukas at manta@mantatools.com. You can try these predicate-based impact analyses in our free trial – just request it using the form on the right. 

 

We cherish your privacy.

And we need to tell you that this site uses cookies. Learn more in our Privacy Policy.