MANTA’s very own Ernie Ostic writes his first-ever post on MANTA’s blog to share his insights on the unmapped sections of your data pipeline and how you can resolve such issues by identifying your black boxes and shining a light on them.
What is a black box?
We have all used or heard the term black box. But what does it mean and how is it applied in the world of data? The history of this phrase is quite interesting, if you believe Wikipedia and other online sources, as it was originally coined during early research into electronic circuitry. It refers to something opaque, that we can’t see through or into, and thus it hides all the details, everything but the inputs and outputs.
These two words have even been used to define aspects of psychology and the human brain. In many ways, we are all walking around with black boxes in our heads, carrying hidden details that are difficult to decipher from the outside. But let’s turn to our data domain and consider how black boxes are keeping us in the dark, both literally and figuratively.
Your black box is anything that is hiding bits and pieces or perhaps even major parts of your data pipeline. It may be the whole pipeline or just a minor piece. What is it preventing you from seeing? It’s not really a physical box, of course, but the concept allows us to define and describe the lack of information. This information includes details that are inaccessible because they are unknown, undocumented, or filed in someone’s mind. In everyone’s data pipelines, the black box covers specific patterns of data flow: the places where columns of information are pulled from or received as well as the programs and processes that take the data and manipulate it and the target systems where it is written.
This is especially true for analytics. Analytics is most at risk for black-box obscurity for many reasons. Among these are the fact that analytics is the farthest downstream—in most organizations, analytic data has the longest journey to its reporting or analytic destination—with the largest number of actions occurring against it. Another blogger compared lineage to the old-fashioned telephone game. The longer the chain of people playing, the more distorted the original message becomes by the time it reaches the end. Analytics also exaggerates the concerns and problems. This is where data is aggregated, teased apart, and changed so that it is easier to understand. Additionally, as you move downstream in your data pipeline, the number of independent branching paths where data can travel expands making different routes for the data more hidden. There are simply more roadways!
Technically speaking, these roadways are built out of code and processing logic of all kinds. They are built of code embedded in databases, purchased applications, your big data and Hadoop environments, your ETL and ELT tooling, and even your independent, employee-owned spreadsheets! They include code that performs translations, string manipulation, date conversions, Boolean tests, rejects, audits, and more—changing your data along the way and delivering it to an unknown number of places. Think of these roadways as hidden corridors. Now imagine those same corridors cloaked in darkness with no way to see their details or what lies ahead or the path to the beginning.
What can you do to see inside your black box?
You need a solution that will give you clarity. You need a solution that will let light into the black box, remove the layers of obscurity, and provide visibility into all your systems. Lineage solutions help you deliver that clarity. Get a lineage platform that can do at least the following, preferably in an automated fashion.
Simplify consumption. Lineage is useless if you can’t understand it. Too often, lineage diagrams look like spaghetti and bewilder the researcher instead of providing insights. Lineage should be filterable, color-coded, and support zoom-in and zoom-out. It should let you drill for more detail or step back for less detail, and it should be easy on the eyes, making it clear where you are when exploring the black box. Imagine being in the dark corridors of your data flows without a flashlight!
Highlight calculations. Your lineage flashlight needs to be able to shine on the details of the path, to be able to look into every niche for low-level calculations. Some consumers don’t need or even want such details, but when they are desired, they should be immediately accessible without requiring a hundred mouse clicks.
Deliver faster and ensure reliability. Lineage reports yield insights in many directions. Very often, our application teams need to gain insight into black boxes when looking downstream. Code maintenance and especially application migration efforts demand the ability to determine the impact of our changes or to properly scope a new project such as a ground to cloud migration. Our teams need to be able to see exactly what is being used throughout the pipeline, to determine if it is worthy of modernization or migration and to help prevent disaster caused by unknowingly breaking a critical report or other important downstream processes.
Enable lineage history. How often has an analyst with their hair on fire come into the office in a panic screaming “Where did we get these numbers for the fourth quarter of last year!?” Reconciling values for upper management is a big part of lineage forensics. Make sure your lineage solution can dive into the past and pull up lineage as it was defined for that particular point in time (last week, last month, last quarter, last year, etc.). Also, ensure that you can make comparisons between today’s lineage and yesterday’s lineage to decipher potential problems or sort out historical code changes.
Increase trust in data and the results of its analysis. Consider your data citizens and analysts, each of whom deserves to trust the data that they are crunching for analysis. Even more importantly, consider the executives who are using analytics results and predictive models to make crucial decisions. Their trust in the data and its results are directly connected to their confidence in using the results to guide the enterprise. Without lineage visibility, trust is difficult or impossible to obtain.
Everyone along the data pipeline needs to be able to see what is happening to your data! Enhance your operations productivity, development, and smarter, faster, more effective decision-making with the implementation of a proper lineage solution. It makes compliance easier because regulators are satisfied when clarity exists for your transformation and data flow logic. Identify your black boxes and shine a light on them with a lineage solution that meets the needs of all your enterprise users, from IT developers to a full range of managers and decision-makers!
This article was also published on Ernie Ostic’s LinkedIn Pulse.