Let’s start with a story. Once upon a time there was a data engineer working as a consultant for a large financial organization consolidating financial and risk data from all their fifteen European and African subsidiaries in one large data lake. It was a very complex system processing vast amounts of data every day with analytical insights and predictions driving business for the whole group. And that system, as every other system, was subject to an endless series of changes – driven by business, legal, compliance, technology, etc.
One day, this data engineer was tasked with what he considered to be a routine change in their production environment. Because he was a good data engineer, he spent several hours trying to analyze the requirement, looking for the best way to implement it and assessing the impact of the change. He also ran together with the QA team several regression tests. But as the system was very complex, they all overlooked something, even with all the effort they put into it. And when the change was implemented, it affected the major pipeline calculating risk metrics that were then pushed back to the respective subsidiaries.
It was a subtle error causing a lot of damage. However, data quality monitoring identified the incident in the next couple of minutes. Happy ending, right? Not really. The organization spent the next five months recovering from that incident. During that time, they had limited use of critical risk and financial data, and the total business impact was around forty million euros. One person, one blind spot, one mistake that never would have happened if that data engineer had a way to run more complete, detailed, and automated impact analysis.
That is the end of the story. But not the end of the problem. As we all work more and more with data, our dependency on it increases rapidly. Our daily decision making is close to impossible without all the reports, intelligent insights, and analytics we have built in the past years. It is no surprise that we increasingly care about trust in data and its reliability. To protect ourselves, we deploy modern data quality and data observability solutions without realizing their limitations and their inherent reactive nature (defects are being identified, not prevented!). As a result, we continue to struggle with a growing number of data incidents, decreasing trust in data and increasing the frustration of both technology professionals as well as business data users.
Mapping Data Dependencies to Prevent Data Incidents
Our goal should not simply be to identify all data bugs once they happen (regardless of how fast we can do that); our ultimate responsibility is to prevent such bugs from happening. To achieve that, a modern data team must first address the underlying issues – to fully understand the map of all dependencies among data in the environment (often called data lineage) and activate it.
Data systems are unique in the way that we need to carefully watch both “the system” (data stores and data pipelines) and data in it. If we want to monitor, control, or improve quality, our approach must address both components. For decades, our main focus has been on data, and we basically ignored data pipelines (the infrastructure to move and transform data, to do calculations and analytics, etc.). That is highly problematic because a significant number of incidents are not caused by wrong data entry. Typical root cause is that something is off with our data pipelines – change in code impacting an important API, new algorithm for bonus calculations, storing data to a table with a different name, or modification of a database column that breaks a management report. Those things typically break our environment once deployed, silently and without warning. And we need tools to shield ourselves from these things and discover them before they happen.
We like to believe that testing will discover all these things. But there are two issues with this. First, testing is very expensive. Based on available research, fixing a bug in the testing phase is at least 10 times more expensive than fixing it in the design phase. Second, testing is usually successful in finding bugs within a system. The real challenge is how changes in one system affect other systems. Integration testing is one of the biggest pains for every organization. It is very slow, very expensive, and very frustrating. I have been there myself, and let’s face it – no matter how hard we try, how many cool APIs and interfaces we build, how many discussions we have about loose coupling, immutability and abstractions, there are simply too many blind spots we have when it comes to integrations and various dependencies among systems.
This is why we see organizations turning more and more to data monitoring, data quality, and data observability tools, which makes total sense. With larger volumes of data and more complex processing (real-time streams, lakehouse architectures, advanced analytics, AI/ML), data incidents may occur at any moment with serious impact on a company’s business. We’ve seen an increase of platforms using good old rule-based patterns plus modern AI to monitor data and detect if something goes wrong (data anomalies, unexpected data profiles, etc.). The value proposition here is very simple – no matter when or how an incident happens, your data observability tool will catch it and notify you promptly. Even at 3:00 a.m. 🙂 Ideally, it also helps you to find the root cause and fix the incident as soon as possible.
Rethinking the Approach to Data Monitoring and Observability
Current approaches to data monitoring, data quality, and observability are similar to black box testing as we know it from software engineering books (if I oversimplify it). The focus is on the running system (in production mostly), and we observe its outputs (logs, data) using a variety of techniques (from pre-defined rules to adapting AI algorithms) to detect that something is not as expected. If implemented correctly, such a platform delivers value (directly or indirectly) to everyone in the organization. Because let’s face it, our dependency on data is growing rapidly and a few undetected incidents can cause major unwanted distractions or massively impact our tactical or even strategic decision making.
And while this approach is a must, it cannot be the ONLY thing we do. Here are two reasons why:
- A majority of incidents are caused by recently implemented changes and can be discovered, even before the change is deployed, using inspection, code analysis, and verification techniques. This concept has been well-known to software developers for ages, and yet, it is still missing in the realm of data systems – mostly because in order for successful implementation, we need to understand detailed dependencies among data elements in the environment.
- With the traditional approach we only have run-time information, typically about recent and/or the most frequent data flow scenarios (every data pipeline has many different ways as to how data may flow based on certain conditions, input data records, etc.). Knowing the internal structure of data pipelines allows us to validate all scenarios. It is especially important when it comes to vulnerabilities or privacy and security.
From Monitoring to Observability to Incident Prevention
If a company wants to improve their strategy to address the quality and reliability of data, it is mandatory to expand their current strategy from data only to observability of data pipelines and shift their focus from monitoring and ex-post detection of incidents to better prevention.
Organizations should implement a data lineage solution with the following core capabilities:
- Detailed data flows
- High level of accuracy and automation
- The ability to capture calculations and transformations together with their semantics
- Tracking historical revisions of lineage
- The ability to ingest additional metadata, such as data quality scores
With AI/ML augmenting current data engineering processes, effective lineage analysis needs to be implemented for predicting problems with the pipeline that may further impact the whole environment and negatively affect data users. This includes, but is not limited to, inconsistent interfaces, incorrect mappings between data structures and ETL/ELT workflows, changes to data structures indirectly influencing other parts of the system, frequent changes to any part of the pipeline, and overly complex parts of the pipeline.
That is the foundation for a future incident-free data infrastructure that we have total visibility and control over, significantly expanding the concept of quality or observability by implementing well-known and successful software engineering concepts of prevention, static analysis, and formal verification on a larger, integrated scale.