How to Evaluate Lineage Next to a Data Catalog

BLOGPOST

How to Evaluate Lineage Next to a Data Catalog

Mallory McGourley Dec 27, 2022 12:00:00 AM

Before you buy, learn what questions to ask your data team to determine if a data catalog or a data lineage platform is right for you.

The Case for Data Lineage

A study by McKinsey Global Institute found that data-driven organizations are 23 times more likely to acquire customers when compared to their competitors. This is not surprising—data has allowed organizations to retain more customers, create better business plans, and produce profits 19 times greater than less data-driven organizations.

But becoming a data-driven organization is easier said than done. Many organizations struggle to build trust in their data, produce accurate reports, and understand their own data architecture. This is where lineage can help.

True data lineage platforms, unlike data catalogs, focus on metadata management. This is the process of scanning, structuring, and visualizing your organization’s data inventories and sources. It allows you to:

Boost your data governance by achieving full compliance and improving data quality
Amplify value by eliminating the risk associated with architecture changes
Speed up data migrations during mergers, acquisitions, or transitions into the cloud

Although data catalogs are extremely useful, most cannot provide the detailed visualizations that true data lineage generates. However, many catalogs are very good at hiding these deficiencies in their lineage capabilities. That’s why we’ve created this guide—to help you evaluate against a data catalog solution.

Understanding Your Objectives

The first step to determining if your company should invest in a data lineage platform or a data catalog is reviewing your objectives. Why is your organization investing in a new solution? Do you have a specific goal to meet? Is there a deadline approaching?

To help you understand these goals, we’ve created a questionnaire. Review these questions with your data team and have a conversation about your data lineage requirements before you book a demo.

Migrations, Mergers, and Acquisitions

Is your organization still using legacy infrastructure, like an on-premises data warehouse or mainframe? How many years do you plan to continue to use this technology?

Who is in charge of documenting legacy systems for migration? What is the process for that documentation?

Is your organization planning to conduct a merger or acquisition in the next five years? What is your plan for inheriting and integrating infrastructure?

If you are planning to migrate an application, can you be sure downstream applications won’t be affected?

Data Governance

When data quality or debugging tickets are created, how are pipelines inspected and identified? Is the research manual or is there current documentation available at all times?

Are all calculations and processing instructions readily available or is a developer needed to find and translate them?

Is your data fully compliant? What change control documentation do you have?

Have you failed a regulatory audit in the past five years?

DataOps

How much time is your team spending conducting manual impact and root-cause analysis?

Can you accurately predict how a planned change will influence other parts of your data environment?

Have you been struggling to stay on budget or meet project deadlines?

What developer turnover have you experienced and the knowledge they are taking with them?

Arm Yourself with Discovery Questions

As we stated previously, data catalog providers are very good at hiding their weaknesses. Even seasoned IT professionals have trouble catching their misdirections. Catalogs have a lot of unique and important capabilities with which to distract you.

The key is to go into the conversation armed with knowledge. You already know what you expect from this solution—what your goals and objectives are. Now, you need to ensure that this solution can help you get there in the most cost-efficient way possible.

We’ve created a list of questions you should ask during a data catalog demo to help you discover this information.

What technologies do you have lineage for?

For data catalogs, a data connector does not ensure lineage support. Some bundle catalog and lineage connectors, while others sell them as two separate components. Because data catalogs do not focus on data lineage alone, they often provide fewer out-of-box scanners than data lineage platforms. It is critical to differentiate between a catalog connector and a data lineage connector when evaluating environment support to ensure you fully understand what support you are getting.

Where are the gaps in my data pipelines?

Once you know which technologies are supported, it is beneficial to look at how these technologies interact with each other and where gaps will be in your data pipelines.

For instance, if you have an application built on MS SQL getting fed into your Snowflake data warehouse by DataStage, there will be no connection between MS SQL and Snowflake if DataStage is not supported. With that break, how will you know what downstream Snowflake assets will be affected by a change in MS SQL? How much time and effort will be spent finding out? The disruptions caused during that time (business not getting answered, other groups called in to help, etc.) can cause a loss in profit and productivity.

What is the level of effort to build lineage for this scanner? Walk me through how each technology is set up.

Sometimes, data catalogs claim that they have scanners for certain technologies. But many of those scanners are not truly automatic and require a lot of manual intervention. Make sure you understand how each technology is set up, what level of effort your team will have to provide to get the lineage up and running, what is the impact on your production systems, and the overall business impact associated with these changes.

What level of detail is provided automatically? Column-level? Transformation logic?

Typically lineage is generated at a table or schema level, if done at a column level, transformation logic is commonly left out (stored procedure, ETL and reporting logic, etc.).

A common pitfall with surface-level lineage is that it associates everything with everything. If one column in a table containing 100 columns flows into a different application than the other 99, table lineage will show those relationships as the same. It will also show all the downstream relationships from there as the same, building a spiderweb of clutter to research when solving for data quality, incident resolution, or tracking dependencies.

Is the lineage design time or run time?

Many tools on the market look only at runtime lineage and not design time. This means you can only see the most recently run path—not all of the paths. If there is a crucial process that runs quarterly or bi-annually, it will not be captured.

Without a full lineage picture given by design time, those critical assets can be easily missed when conducting impact analysis, integrating systems, or building a new data platform off legacy application logic, leading to massive business disruptions.

True data lineage should provide design time, which allows you to view all data paths, calculations, and interactions.

How can I get the most out of my lineage?

Typically, catalogs check the box on lineage support and focus on their strengths, but getting lineage collected and visualized is just the start. The next step is focusing on how you can extract the most value out of what you’ve collected.

How are complex processes visualized? Is a stored procedure one small icon or is it broken out by steps in the process with access to the important underlying code?

What happens if an asset isn’t scanned? Can the platform detect that missing asset and make a guess to point you in the right direction?

How are filters indirectly impacting data flows shown? Being able to see filters like joins or where clauses that do not directly move data but still impact how it flows is crucial when changing assets or understanding how a value was calculated. Making these easily ingestible for a user will reduce the risk of oversite and further deepen their visibility.

How are changes in data pipelines shown? Whether it is documenting change control in a critical finance/risk report, identifying what change occurred to break a report, or seeing a report has a new source system, the ability to see changes in pipelines provides a whole new level of insight into the value you can deliver from your active metadata management program.

How MANTA Out-Performs Data Catalogs

MANTA is a true data lineage platform. Unlike data catalogs, it focuses on one problem—data visibility.

MANTA brings intelligence to metadata management by providing an automated solution that helps you drive productivity, gain trust in your data, and accelerate digital transformation. The platform includes unique features to make the most value out of your lineage, with more than 40 out-of-the-box, fully automated scanners.

In addition, MANTA works alongside popular data catalogs. The platform integrates with catalogs like Collibra, Informatica, Alation, and more. This allows you to harness the power of true data lineage while still benefiting from your current catalog solution.

MANTA Can Help

Are you still struggling to understand your data lineage requirements? Are you unsure if your data catalog solution is meeting the mark? MANTA can help. We’ve been innovating in the data lineage space for nearly a decade. We work with our clients to help them get the best value from their lineage solution.

Book a demo today to talk with a MANTA engineer and learn more about how we can help.

BLOGPOST