“Big Data is not a system; it is simply a way to say that you have a lot of data.” Tomas Kratky explores why we shouldn’t ditch the data warehouse because there is, in fact, no successor yet.
Just a few days ago LinkedIn notified me about one very interesting research study “Is the Data Warehouse Dead?” (done by The Information Difference and sponsored by IBM in January 2015). Although this study is more than a year old, it is, I believe, still valid. The focus of the study was simply the “data warehouse” itself and its future existence in the new world of Big Data technologies. I am happy that someone has started to ask more challenging questions in a world where the buzzwords Big Data are a cure for everything. I definitely suggest that everyone read the study, but I still have a few serious issues with it.
Not a System, Just Data
The whole idea of asking if the DWH is going to be replaced by Big Data is stupid (please, pardon my language). The DWH is a critical system which is responsible for gathering and consolidating data and providing the information needed to run a business. On the contrary, Big Data is, as defined in the study, “the term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Do you see the problem? Big Data is not a system; it is simply a way to say that you have a lot of data. It does not make sense to ask if Big Data is going to replace DWHs. The right question to ask is if Big Data technologies are going to penetrate our existing DWH world and how.
Everybody’s Happy, Right?
Two questions in the study were aimed at the costs of DWH maintenance and user satisfaction. The result was that only 27% of companies are unhappy about the costs. WTF? And less than 10% of respondents find their DWH unsatisfactory (from the usage perspective). Really? Every day I hear companies complaining about DWH maintenance costs being so high, changes being so slow, DWH teams being so rigid and formal, data quality being so bad, and I could go on and on. So what is the issue?
Just take a look at the types of respondents surveyed. No surprise that around 80% of them are IT guys. IT is usually responsible for DWH design, development, and maintenance. If someone asks you about a house you built with your own hands and which is your responsibility, will you tell him that the house is bad, expensive, and hard to live in? I would say not; your response will typically be “yeah man, it’s a great house” or “it’s not bad” at worst. Dear analysts, go ask CFOs and business users and your survey will look completely different.
But despite those things I don’t like about the study, the parts with actual facts are really useful. For example – most companies have more than two DWHs (70%) with less than 50 terabytes of data (53%) and need at least 7 people just to maintain the DWH (up to hundreds). It is also obvious that most companies are still playing with Big Data technologies without any need or proper plan.
I’m suspicious about all those consultants, analysts, and “industry experts” talking about the importance of Big Data. All of them are showing us examples of sensors. But you know what? Most companies do not have any sensors or, to be honest, do not need any sensors. And data coming from web logs and mobile apps? That’s nothing special.
Another argument for Big Data is that it allows you to store everything that you will need in the future. Okay, do you really believe that a company can store all its data on Hadoop? Of course not! You can’t store everything. It is not even possible to say what “everything” is. You must choose what to store. And we are still not able to foresee future, so we simply select those data sets which we believe could be most useful later on.
Big Data Is Not for Everything (Or for Everyone)
And this is exactly why almost all companies are still struggling with Big Data – trying to figure out how to use it effectively without destroying their existing infrastructure. And this is also a reason why the study revealed no success stories and no really big data out there managed by Big Data technologies. And you know what the saddest thing is about the whole study? Question number 8 asked respondents if they view Big Data as being important for their organization. And you know what? 64% consider Big Data to be important or even very important. That’s actually not surprising. Most of the respondents were IT guys, and all of us engineers love to play with new toys. But why is Big Data important for your business? And how exactly do they plan to use those technologies?
I see the DWH as a complex system where multiple technologies are needed. And Big Data technologies are a nice evolutionary step. Relational databases, NoSQL databases, Hadoop, Spark, and other components have their value when used properly. My experience is that Big Data technologies are very efficient for storing and processing unstructured data. We can also use Big Data technologies very effectively to build staging areas, to replace old ways of doing ETLs, and to reduce licensing costs significantly (one of our clients recently shared with me that more than 50% of their Teradata processing capacity is used for ETLs).
This is exactly where all development is heading right now. We have already met a significant number of customers redesigning their DWH in this manner. And most of the respondents in the study feel the same way. But we must be very careful with the governance of new modernized DWHs. Everything is so young, immature. When rebuilding your staging area you can’t loose your ability to gather and use metadata, to work with data lineage and do impact analyses; you can’t exclude it from data quality processes; you can’t put everything you’ve built over the last few years in danger. If you screw it up, your maintenance costs will simply blow up.
People have their limitations. Solving complex problems is a hard job for most of us. There’s nothing bad about it; our brain has its boundaries and DWH design is a really complex task. Things change every day, every minute and must be incorporated to continue supporting business needs. It is no surprise that in most organizations, the DWH has serious problems after several years. An incoming Big Data hype might push those who are less experienced to rebuild the whole DWH with those shiny new Big Data technologies. Surprisingly, it will only lead to a bigger mess with two overlapping systems, no data governance, and maintenance costs that are through the roof.
One of respondents described it nicely: “I suspect that ‘Big Data’ is a way to THINK that you are obtaining good data while avoiding the hard work of understanding and designing data models.” Turing Award Winner Fred Brooks once said: “There is no silver bullet.” And, in fact, there is none.