For the past decade, we’ve been turning data processing upside down. From collecting historical data to building reports and insights about the past, we have progressed to the BigData era where we collect and process every possible data record we can get our hands on, without actually knowing how we will use it. All of this is done in hopes that future generations of smarter algorithms will be able to help us drive the future of our businesses forward. But all those efforts are overshadowed by the ethical questions we should be asking ourselves: how do we make sure we’re using this data for good?
Data Is NOT A Problem
Walking down the path of data infrastructure evolution, you will notice that it’s getting more and more complex with every step. It is not the data that causes problems for us—it’s how we handle and process it. We have built these massive data pipelines over time combining everything one can imagine—batch, real-time, streams, microservices, cloud, noSQL, AI/ML—all to get more valuable insights to drive our business and, at the same time, make sure, and be able to prove, that we do it in an ethical and responsible way. For both new pipelines as well as review and vet the existing ones.
And yet, for all these years, we have had no tools that could help us fully understand and control this complexity. It feels like there’s always something we are missing, something we can’t put our finger on. Our blind spots. And that comes at a price:
- Delivery of new analytical / predictive insights slows down as we struggle to understand a complex environment and lose the ability to quickly change existing systems based on current business needs and in compliance with regulatory (mainly around privacy) and other (including ethics, sustainability) requirements.
- Trust in reports, dashboards, and analytical insights sinks as we are unable to explain fully how the numbers we present have been calculated, what their origin is, and what their associated data quality or data privacy attributes are.
- The number of data incidents grows as we are unable to fully assess the impact of new business requirements or technical changes that must be implemented in our data systems.
- We waste the valuable time of data engineers, who spend 40% to 50% of their time running manual impact analysis, assessments, or root-cause analysis. Not to mention how frustrating it is.
- Increasing risk of non-compliance and regulatory penalties, especially around data privacy (GDPR, CCPA/CPRA, HIPAA, etc.) and data quality (BASEL, SOX, AML, etc.).
That all inevitably leads to missed opportunities (derive more business value from data), wasted investments (billions spent on building uncontrollable data infrastructures), and frustration on both business and technology sides.
Your Metadata Doesn’t Tell You Everything
Key to our evolution with data is, no surprise, metadata. But not just any metadata. We need to be selective about what metadata we use, and we need to be smart about how we use it. Metadata has been with us for ages but was never really successful. Part of the problem is that we just decided to collect it without thinking if and how it may be useful (similar to what we did with data at the beginning of big-data hype).
To maintain an overview of our data, we look at basic metadata such as data profiles (type of data, business classification, quality score, etc.) or operational characteristics (who accesses data, how often, popular data sets, etc.). While these pieces of information are interesting and useful, they are static and don’t cast any light on blind spots of complex data pipelines. True power comes with a detailed understanding of data lineage. True data lineage, not what is often mistaken for it.
Metadata — Unsung Hero of Effective Data Management
Thanks to vendors offering only very rudimentary data lineage capabilities and trying to hide their deficiencies, most people see data lineage as information about source of data and its journey from one table to another up to reports or dashboards. But true data lineage is far more than that. Data lineage represents a detailed map of all direct and indirect dependencies among data entities in the environment. Why is that so important?
- Are you wondering how a change of a bonus calculation algorithm in the sales data mart will affect your weekly financial forecast report, and if it is going to be correct with the forecast definition?
- Are you thinking about what is the best subset of test cases covering the majority or all of data flow scenarios to run for your newly released pricing database app?
- Are you questioning if existing ETL/ELT processes have any overlaps and if consolidation is possible?
- Are you planning massive migration to the cloud and looking for a way to divide the data system into smaller chunks and migrate them independently without breaking other pieces?
These are just a few examples of how data lineage can change the whole “complexity game” for you and provide a panoramic overview of your data landscape. And yes, data lineage does give you information about the journey of data, helps you with regulatory compliance, or allows you to chase and fix data incidents in seconds.
But even data lineage must be activated to become truly beneficial for an organization that wants to stay relevant in today’s fast-paced environment. What is activation? In layman’s terms, it means for an automated program to take data lineage information that is just sitting somewhere and turn it into actions. Examples?
- Continuous detection of “dead tables” where potentially sensitive information is stored but not accessed or used (like when we create a helper table to process large amounts of data and we forgot to drop it, or when we decommission part of a system without actually knowing what all its components are and we forget about some).
- Instant alerts on changes to the environment with negative impact on tactical management reports, or on key data features used by a data science team (like if you change a gender column in a database from male/female to male/female/other without realizing that the data from that column is used as an important filtering condition down the road in a Python algorithm).
- Proactive notifications to inform us about too complex parts of our data pipelines where refactoring or redesign would help to reduce risk of failure.
- Proactive alerts to the users of the reports about the report being out of date caused by a failed pipeline somewhere upstream.
- Proactive alerts to the AI/ML components of the data pipelines about the changes in the upstream that may affect the AI/ML input features and require the model to be retrained.
- Automated selection of components that should be deployed to production with a new release of a data system to make sure all dependencies are included and the whole package does not contain any unnecessary parts.
- Warnings if we design a data pipeline moving data between locations where no data should ever be moved (like if we accidentally build a data flow between production and test data lake, or between two countries with incompatible data protection policies).
- Warnings if we design a data pipeline moving PI or PII data to a storage or a system that is not certified to hold PI or PII data.
What Does 2022 Have in Store for Data?
The underlying theme here is that to beat the complexity we deal with, we need machines, automation, and intelligence to help us. We have customers complaining about how complex their data lineage is even if they visualize only a low level of detail. And I am not surprised, with so massive data systems they have built, there is a limit of what a human brain can process. So, if you ask me what 2022 (and probably also 2023, 2024, 2025, etc.) will be about in data, I would bet on smart activation of carefully selected types of metadata to make life easier for data professionals, ensure we do it in an ethically responsible way, and make data more valuable and useful to all organizations across the globe. Cheers to that!