Tomáš Krátký

A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

September 1, 2017 by

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much funnier place. But no, I am not crazy and this is not a funny story. [LONG READ]

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much funnier place. But no, I am not crazy and this is not a funny story. [LONG READ]

It is, surprisingly, a metadata story. A few months ago, when visiting one of our most important and hottest prospects, we arrived at the building (a super large finance company with a huge office), signed in and passed through security, called our main contact there, shook hands with him, and entered their private office space with thousands of work desks and chairs, plus many restrooms, kitchens, paintings, and also meeting rooms.

The Ghost of Blueberry Past

A very important meeting was ahead of us, with the main business sponsor who had significant influence over the MANTA purchasing process. Our main agenda was to discuss business cases involving metadata and the role of Manta Flow. So we followed our guide and I asked where we were going. “The blueberry meeting room”, he replied. We stopped several times, checking our current position on the map and trying to figure out where to go next. (It is a really super large office space.) After 10 minutes, we finally got very close, at least according to the map. Our meeting room should have been, as we read it on the map, straight and to the left. But it was not! We ran all over the place, looking around every corner, checking the name printed on every meeting room door, but nothing. We were lost.

Fortunately, there was a big group of people working in the area, so we asked those closest to us. Several guys stood up and started to chat with us about where that room could be. Some of them started to search for the room for us. And luckily, there was one smart and knowledgeable woman who actually knew the blueberry meeting room very well and directed us to it. In 20 seconds, we were there with the business sponsor, although we were a few minutes late. Uffff.

That’s a Suggestive Question, Sir!

Our gal runs a big group, a business, and BI analysts who work with data every single day – they do impact and what-if analyses for the initial phase of every data-related project in the organization. They also do plenty of ad-hoc analyses whenever something goes wrong. You know, to answer those tricky management questions like:

“How did it happen that we didn’t approve this great guy for a loan five months ago?”

or

“Tell me if there is any way a user with limited access can see any reports or run any ad-hoc queries on sensitive and protected data that should be invisible to her?”

And I knew that they had very bad documentation of the environment, non-existing or obsolete (which is even worse) as do many organizations out there, most of it in Excel sheets that were manually created for compliance reasons and uploaded to the Sharepoint portal. And luckily for us, they had recently started a data governance project with only one goal – to implement Informatica Metadata Manager and build a business glossary and an information catalog with a data lineage solution in it. It seemed to be a perfect time for us with our unique ability to populate IMM with detailed metadata extracted from various types of programming code (Oracle, Teradata, and Microsoft SQL in this particular environment).

Just Be Honest with Yourself: Your Map Is Bad

So I started my pitch about the importance of metadata for every organization, how critical it is to cover the environment end-to-end, and also the serious limitations IMM has regarding programming code, which is widely used there to move and transform data and to implement business logic. But things went wrong. Our business sponsor was very resistant to believe the story, being pretty OK with what they have now as a metadata portal. (Tell me, how anyone can call Sharepoint with several manually created and rarely updated Excel sheets, a metadata portal? I don’t understand!) She asked us repeatedly to show her precisely how we can increase their efficiency. And she was not satisfied with my answers based on our results with other clients. I was lost for the second time that day.

And as I desperately tried to convince her, I told her the story about how we get lost and mixed it with our favorite “metadata like a map, programming code like a tricky road” comparison. “It is great that you even have a map”, I told her. “This map helped us to quickly get very close to the room and saved us a lot of time. But even when we were only 40 meters from our target, we spent another 10 minutes, the very same amount of time needed to walk all the way from the front desk to that place, looking for our room. Only because your great map was not good enough for the last complex and chaotic 5% of our trip. And what is even worse, others had to help us, so we wasted not only our time, but also theirs. So this missing piece of the map led to multiple times increased effort and decreased efficiency. And now think about what happens if your metadata map is not complete from 40% to 50%, which is the portion of logic you have hidden here inside various kinds of programming code invisible to IMM. Do you really want to ignore it? Or do you really want to track it and maintain it manually?”

And that was it! We got her. The rest of our meeting was much nicer and smoother. Later, when we left, I realized once again how important a good story is in our business. And understandability, urgency and relevance for the customer are what make any story a great one.

And what happened next? We haven’t won anything yet, it is still an open lead, but now nobody has doubts about MANTA. They are struggling with IMM a little bit. So we are waiting and trying to assist them as much as possible, even with technologies that are not ours. Because in the end it does not matter if we load our metadata into IMM or any other solution out there. As long as there is any programming code there, we are needed.

This article was originally published on Tomas Kratky’s LinkedIn Pulse.

Return of the Metadata Bubble

July 27, 2017 by

The bubble around metadata in BI is back – with all it’s previous sins and even more just around the corner. [LONG READ]  

The bubble around metadata in BI is back – with all it’s previous sins and even more just around the corner. [LONG READ]  

In my view, 2016 and 2017 are definitely the years for metadata management and data lineage specifically. After the first bubble 15 years ago, people were disappointed with metadata. A lot of money was spent on solutions and projects, but expectations were never met (usually because they were not established realistically, as with any other buzzword at its start). Metadata fell into damnation for many years.

But if you look around today, visit few BI events, read some blog posts and comments on social networks, you will see metadata everywhere. How is it possible? Simply, because metadata has been reborn through the bubble of data governance associated with big data and analytics hype. Could you imagine any bigger enterprise today without a data governance program running (or at least in its planning phase)? No! Everyone is talking about a business glossary to track their Critical Data Elements, end-to-end data lineage is once again the holy grail (but this time including the Big Data environment), and we get several metadata related RFPs every few weeks.

Don’t get me wrong, I’m happy about it. I see proper metadata management practice to be a critical denominator for the success of any initiative around data. With huge investments flowing into big data today, it is even more important to have proper governance in place. Without it, no additional revenue, chaos, and lost money would be the only outcome of big (and small) data analytics. My point is that even if everything looks promising on the surface, I feel a lot of enterprises have taken the wrong approach. Why?

A) No Numbers Approach

I have heard so often that you can’t demonstrate with numbers how metadata helps an organisation. I couldn’t disagree more. Always start to measure efficiency before you start a data governance/metadata project. How many days does it take, on average, to do an impact analysis? How long does it take, on average, to do an ad-hoc analysis. How long does it take to get a new person onboard – data analyst, data scientist, developer, architect, etc. How much time do your senior people spend analysing incidents and errors from testing or production and correcting them? My advice is to focus on one or two important teams and gather data for at least several weeks, or better yet, months. If you aren’t doing it already, you should start immediately.

You should also collect as many “crisis” stories as you can. Such as when a junior employee at a bank mistyped an amount in a source system and a bad $1 000 000 transaction went through. They spent another three weeks in a group of 3 tracking it from its source to all its targets and making corrections. Or when a finance company refused to give a customer a big loan and he came to complain five months later. What a surprise when they ran simulations and found out that they were ready to approve his application. They spent another 5 weeks in a group of 2 trying to figure out what exactly happened to finally discover that a risk algorithm in use had been changed several times over the last few months. When you factor in bad publicity related to this incident, your story is more than solid.

Why all this? Because using your numbers to build a business case and comparing them with numbers after a project to demonstrate efficiency improvements and those well-known, terrifying stories that cause so many troubles to your organisation, will be your “never want it to happen again” memento.

B) Big Bang Approach

I saw several companies last year that started too broad and expected too much in very short time. When it comes to metadata and data governance, your vision must be complex and broad, but your execution should be “sliced” – the best approach is simply to move step-by-step. Data governance usually needs some time to demonstrate its value in reduced chaos and better understanding between people in a company. It is tempting to spend a budget quickly, to implement as much functionality as possible and hope for great success. In most cases, however, it becomes a huge failure. Many, good resources are available online on this topic, so I recommend investing your time to read and learn from others’ mistakes first.

I believe that starting with several, critical data elements most often used is the best strategy. Define their business meaning first, than map your business terms to the the real world and use an automated approach to track your data elements both at a business and technical level. When the first, small set of your data elements is mapped, do your best to show their value to others (see the previous section about how to measure efficiency improvements). With success, your experience with other data sets will be much smoother and easier.

C) Monolithic Approach

you collect all your metadata and data governance related requirements from both business and technical teams, include your management and other key stakeholders, prepare a wonderful RFP and share it with all vendors from the top right Gartner Data Governance quadrant (or Forrester wave if you like it more). You meet well-dressed sales people and pre-sales consultants, see amazing demonstrations and marketing papers, hear a lot of promises how all your requirements will be met, pick up a solution you like, implement it, and earn you credit. Prrrrr! Wake up! Marketing papers lie most of the time (see my other post on this subject).

Your environment is probably very complex with hundreds of different and sometimes very old technologies. Metadata and data governance is primarily an integration initiative. To succeed, business and IT has to be put together – people, systems, processes, technologies. You can see how hard it is, and you may already know it! To be blunt, there is no single product or vendor covering all your needs. Great tools are out there for business users with compliance perspectives such as Collibra or Data3Sixty, more big data friendly information catalogs such as Alation, Cloudera Navigator, or Waterline Data, and technical metadata managers such as IBM Governance Catalog, Informatica Metadata Manager, Adaptive, or ASG. Each one of them, of course, overlaps with the others. Smaller vendors then also focus on specific areas not covered well by other players. Such as MANTA, with the unique ability to turn your programming code into both technical and business data lineage and integrate it with other solutions.

Metadata is not an easy beast to tame. Don’t make it worse by falling into the “one-size-fits-all” trap.

D) Manual Approach

I meet a lot of large companies ignoring automation when it comes to metadata and data governance. Especially with big data. Almost everyone builds a metadata portal today, but in most cases it is only a very nice information catalog (the same sort you can buy from Collibra, Data3Sixty, or IBM) without proper support for automated metadata harvesting. The “How to get metadata in” problem is solved in a different way. How? Simply by setting up a manual procedure – whoever wants to load a piece of logic into DWH or Data lake has to provide associated metadata describing meaning, structures, logic, data lineage, etc. Do you see how tricky this is? On the surface, you will have a lot of metadata collected, but every bit of information is not reality – it is a perception of reality and only as good as the information input by a person. What is worse, is that it will cost you a lot of money to keep synchronised with real logic during all updates, upgrades, etc. The history of engineering tells us clearly one fact – any documentation, especially documentation not an integral part of your code/logic, created and maintained manually, is out of date the very same moment it was created.

Sometimes there is a different reason for harvesting metadata manually – typically when you choose a promising DG solution, but it turns out that a lot is missing. Such as when your solution of choice cannot extract metadata from programming code and you end up with an expensive tool without the important pieces of your business and transformation logic inside. Your only chance is to analyse everything remaining by hand, and that means a lot of expense and a slow and error-prone process.

Most of the time I see a combination of a), c) and d), and in rare cases also with b). Why is that? I do not know. I have plenty of opinions but none of them have been substantiated. One thing for sure is that we are doing our best to kill metadata, yet again. This is something I am not ready to accept. Metadata is about understanding, about context, about meaning. Companies like Google and Apple have known it for a long time, which is why they win. The rest of the world is still behind with compliance, regulations being the most important factor why large companies implement data governance programs.

I am asking every single professional out there to fight for metadata, to explain that measuring is necessary and easy to implement, small steps are much safer and easier to manage than a big bang, an ecosystem of integrated tools provides greater coverage of requirements than a huge monolith, and that automation is possible.

Tomas Kratky is the CEO of MANTA and this article was originally published on his LinkedIn Pulse. Let him know what you think on manta@getmanta.com.

One Small Step for MANTA, One Big Leap for Mankind

June 30, 2017 by

Tomas Kratky explores his vision behind MANTA’s new capability to visualize business & logical lineage.

Tomas Kratky explores his vision behind MANTA’s new capability to visualize business & logical lineage.

We just recently published a blog post announcing one new feature – MANTA now works not only with physical lineage but with business and logical lineage as well. I was shocked by the intensity of the feedback we got from our customers and partners – they were confused. MANTA has a clear vision to provide users with the most detailed, accurate, and fully automated data lineage from programming code. We do it because all data-driven organizations need it, because others are afraid to do it, and because we are smart.

New Levels of Lineage

But now we have announced business lineage and everyone has been asking what that means. Is MANTA moving towards being a more general metadata or data governance solution? NOT AT ALL! So why the business and logical lineage? Let me explain a little bit more.

MANTA offers capabilities not covered by other players, capabilities very much needed in any data intensive environment. But MANTA is not a metadata manager or information catalog. There are other better equipped vendors like IBM, Informatica, Collibra, Alation, Adaptive, etc. This means that with some exceptions MANTA alone does not meet all the metadata related requirements of a customer. But other metadata solutions, when selected, purchased, and deployed by a customer, also fail to meet several critical needs related to metadata accuracy and completeness, especially regarding data processing logic hidden inside programming code. This leads to an inevitable conclusion – MANTA is usually served together with other tool(s).

MANTA: Born To Integrate

Simply said, we live and die with great integrations. We have many prospects out there, since almost everyone will need us sooner or later, but to fully demonstrate our value, we need smooth integration with existing data governance / metadata solutions. We originally started with more technical oriented tools like Informatica Metadata Manager, so physical lineage was the best option. But now more and more customers have Collibra, IBM Information Governance Catalog, Alation, Data3Sixty, or Axon, and they want to see lineage there. But those solutions are not designed to capture and visualize large amounts of data processing metadata. They tend to slow down or even crash with the millions of processing steps you have in your environment.

Automate or Drown

Some vendors in this space don’t even offer automated harvesting capabilities. Some of them do, but in a limited way. So I very often see customers trying to build simple business level lineage manually. And this is where our unique features come into play. MANTA still harvests physical technical metadata from your programming code but is now also able to use existing business or logical mappings to prepare a different perspective – simplified, with easier to understand names and descriptions, but still accurate, complete, and fully automated. It allows us to easily integrate with all the not-so-technical solutions mentioned above. It means less wasted effort and fewer stressful moments for our customers and more prospects for MANTA. I see it as a win-win situation.

This article was originally published on Tomas Kratky’s LinkedIn Pulse.

The Dark Side of the Metadata & Data Lineage World

June 10, 2017 by

You wouldn’t believe it, but there is a dark side to the metadata & data lineage world as well. Tomas Kratky digs deep and explains how you can get into trouble. 

You wouldn’t believe it, but there is a dark side to the metadata & data lineage world as well. Tomas Kratky digs deep and explains how you can get into trouble. 

It has been a wonderful spring this year, hasn’t it? The first months of 2017 were hot for us. Data governance, metadata, and data lineage are everywhere. Everyone is talking about them, everyone is looking for a solution. It’s an amazing time. But there is also the other side, the dark side.

The Reality of Metadata Solutions

As we meet more and more large companies, industry experts & analysts, investors and just data professionals, we see a huge gap between their perception of reality and reality itself. What am I talking about? About the end-to-end data lineage ghost. With data being used to make decisions every single day, with regulators like FINRA, Fed, SEC, FCC, and ECB requesting reports, with initiatives like BCBS 239 or GDPR (a new European Data Protection Directive), proper governance and a detailed understanding of the data environment is a must for every enterprise. And E2E (end-to-end) data lineage has become a great symbol of this need. Every metadata/data governance player on the market is talking about it and their marketing is full of wonderful promises (in the end, that is the main purpose of every marketing leaflet, isn’t it?). But what’s the reality?

The Automated Baby Beast

The truth is, that E2E data lineage is a very tough beast to tame. Just imagine how many systems and data sources you have in your organization, how much data processing logic, how many ETL jobs, how many stored procedures, how many lines of programming code, how many reports, how many ad-hoc excel sheets, etc. It is huge. Overwhelming!

If your goal is to track every single piece of data and to record every single processing step, every “hop” of data flow through your organization, you have a lot of work to do. And even if you split your big task into smaller ones and start with selected data sets (so-called “critical data elements”) one by one, it can still be so exhausting that you will never finish or even really start. And now data governance players have come in with gorgeous promises packaged in one single word – AUTOMATION.

The promise itself is quite simple to explain – their solutions will analyze all data sources and systems, every single piece of logic, extract metadata from them (so-called metadata harvesting), link them up (so-called metadata stitching), store them and make them accessible to analysts, architects, and other users through best-in-class, award-winning user interfaces. And all of this through automation. No manual work necessary, or just a little bit. Is it so tempting that you are open to it, you want to believe. And so you buy the tool. And then the fun part starts.

The Machine Built to Fail

Almost nothing works as expected. But somehow you progress with the help of hired (and usually overpriced) experienced consultants. Databases (tables, columns) are there, your nice graphically created ETL jobs are there, your first simple reports also, but hey! There is something missing! Why? Simply because you used a nasty complex SQL statement in your beautiful Cognos report. And you used another one when you were not satisfied with the performance of one Informatica PowerCenter job. And hey! Here, lineage is completely broken? Why is THAT? Hmmm, it seems that you decided to write some logic inside stored procedures and not to draw a terrifying ETL workflow, simply because it was so much easier with all those Oracle advanced features. Ok, I believe you have got it. Different kinds of SQL code (and not just SQL but also Java, C, Python and many others) are everywhere in your BI environment. Usually, there are millions and millions of lines of code everywhere. And unfortunately (at least for all metadata vendors) programming code is super tough to analyze and extract the necessary metadata from. But without it, there is no E2E data lineage.

At this moment, marketing leaflets hit the wall of reality. As of today, we have met a lot of enterprises but only very few solutions capable of automated metadata extraction from SQL programming code. So what do most big vendors usually do in this situation (or big system integrators implementing their solutions)? Simply finish the rest of the work manually. Yes, you heard me! No automation anymore. Just good old manual labor. But you know what – it can be quite expensive. For example, a year ago we helped one of our customers reduce the time needed to “finish” their metadata project from four months to just one week! They were ready to invest the time of five smart guys, four months per person, to manually analyze hundreds and hundreds of BTEQ scripts, extract metadata from them and store them on the existing metadata tool. In the United States, we typically meet clients with several hundreds of thousands of database scripts and stored procedures. That’s sooo many! Who is going to pay for that? The vendor? The system integrator? No, you know the answer. In most cases, the customer is the one who has to pay for it.

Know Your Limits

I have been traveling a lot the last few weeks and have met a lot of people, mostly investors and industry analysts, but also a few professionals. And I was amazed by how little they know about the real capabilities and limitations of existing solutions. Don’t get me wrong, I think those big guys do a great job. You can’t imagine how hard it is to provide a really easy-to-use metadata or data governance solution. There are so many different stakeholders, needs and requirements. I admire those big ones. But it should not mean that we close our eyes and pretend that those solutions have no limitations. They have limitations and fortunately, the big guys, or at least some of them, have finally realized, that it is much better to provide open API and to allow third parties like Manta to integrate and fill the existing gaps. I love the way IBM and Collibra have opened their platforms and I feel that others will soon follow.

How can you protect yourself as a customer? Simply conduct proper testing before you buy. Programming code in BI is ignored so often, maybe because it is very low-level and it is typically not the main topic of discussion among C-level guys. (But there are exceptions – just recently I met a wonderful CDO of a huge US bank who knows everything about the scripts and code they have inside their BI. It was so enlightening, after quite a long time.) It is also very hard to choose a reasonable subset of the data environment for testing. But you must do it properly if you want to be sure about the software you are going to buy. With proper testing, you will realize much sooner that there are some limitations and you will start looking for a solution to solve them up-front, not in the middle of your project, already behind schedule and with a C-level guy sitting on your neck.

It is time to admit that marketing leaflets lie in most cases (oh, sorry, they just add a little bit of color to the truth), and you must properly test every piece of software you want to buy. Your life will be much easier and success stories of nicely implemented metadata projects won’t be so scarce.

Originally published on Tomas Kratky’s LinkedIn profile.

MANTA 2016: A New Home, New Challenges and a New Name

December 30, 2016 by

Only a few days are left. 2017 is knocking on our doors harder and harder. We, at MANTA, had a wonderful year, a really amazing one.

Only a few days are left. 2017 is knocking on our doors harder and harder. We, at MANTA, had a wonderful year, a really amazing one.

A New Home

So many things have happened. Let’s start where it ends. Just a few weeks ago we finally found a new office, a wonderful place on I.P. Pavlova square, very close to the historic center of Prague. It was so hard; it took more than six months to find something really appropriate. But this place is cool – under the roof, a lot of space and light and energy. It will become our new home, and we are very happy about it. It is our Christmas gift. The best place for the best team.

New Challenges

Data governance, metadata, and especially data lineage became very hot in 2016. There are many reasons why that happened, but we are excited to see how companies both in the US and Europe have started their governance programs. Also, vendors have become more active in this space and a lot has changed on the metadata market map. But what has remained the same is the need to harvest metadata in an automated way from many different kinds of systems, both old and new, most of the time implemented in the form of programming code.

And while this is a weak spot for all well-known metadata players, MANTA shines when it comes to code. So, that has led us to more and more partnerships with various technology vendors. We are well known for our integration with IMM, but as of this year we are also partners with two other market leaders – IBM and Teradata. We also support Microsoft SQL Server, and a few more additions to our technology stack are planned for the first months of 2017.

New Business and a New Name

But there are also other things that have made us happy. We have continued on our US journey and signed on several new customers there, some of them really huge like one of the TOP4 US banks and one of the TOP5 mutual funds in the world. It is not a surprise that we have been growing super fast. We tripled our revenue in 2016 and our pipeline has simply exploded.

This leads us to another major change – we have decided to drop the “Tools” from our name and became only MANTA. Everybody was calling us that anyway, so why resist the urge? We’ll tell you more about it in the first week of 2017.

But the best part was getting feedback from our customers. They like what we do and how we do it. And that has been the greatest message for the whole MANTA team. It keeps us moving forward and working hard to exceed their highest expectations.

Thank you for everything in 2016 and see you in 2017!

Metadata as Explained to My Grandma

September 25, 2016 by

Our CEO Tomas Kratky explains to his grandma what we do: “Metadata management is simply cartography.” 

Our CEO Tomas Kratky explains to his grandma what we do: “Metadata management is simply cartography.” 

A few weeks ago, I talked to my grandma. I travel a lot and she lives quite far away, so we hadn’t seen each other for a long time. She is an old lady, very kind and also very curious about what I do and how I live. She had many questions (half of them about kids which I don’t have yet) but everything went well till she asked about Manta Tools, the company I run. “Tomas”, she said, “tell me more about what you do for a living. You do computers, don’t you?”

I am used to talking about my vision, about what we do as a company and why, with so many people – our team, our clients, our partners, our investors. But none of them are like my grandma. The most advanced thing she can imagine is how to turn on her television. She doesn’t care about big data or any other similar buzzwords, so I was absolutely unable to say even a single word. When I left her room, I knew that I had failed badly.

I spent several weeks thinking about the best way to describe the world I live in, and this is the result. I haven’t used it yet; I wanted to share it with you, other professionals. I hope you’ll find it useful and helpful. And I would be glad to hear your feedback to further improve my story. In the end, I want to make my grandma proud of me.

You Need Information…

There are many huge companies like banks, telcos and utilities around us. Let’s imagine that one huge company is a country, like the United States, for example. Such a company provides a lot of services to its customers and runs a lot of processes. To do that efficiently, they need a lot of IT systems like online banking and accounting. And there are also many other systems which are owned by other companies but are somehow visible and even accessible to our example company, like weather forecasts and social networks. All those systems are valuable sources of information – about clients, about partners, about the environment, about our own performance, etc. Information is our gold today and it always has been. You are stronger when empowered with the right information delivered at the right time. So if information is gold, those systems I’ve just mentioned are gold mines.

…To Make Decisions

The main purpose of information is to use it to make decisions. After all, we live in a data-driven world. Let’s pretend that decision makers are dukes, lords and kings living in that country. They own the gold, for sure, but to use it, we must dig it up and transport it to them. And you know what? This is exactly what business intelligence is about, isn’t it? The mines are data analytics in the real world; the trucks and roads are ETLs (or data processing).

So, we have a great country with gold everywhere and lords eager to get it. And we have a way to find, dig up and deliver the gold to them. But, just as in the real world, our country is large with deep valleys, fierce rivers, steep mountains, dark woods, hot deserts, and insidious swamps. Delivering the gold from the mines to the lords is sometimes a very dangerous job, as you can imagine. We can mistake rocks for gold, or we can damage it (this is why everybody runs data quality programs). But it is also dangerous for us – we can easily get lost if we choose to follow the wrong road. And so, we need maps.

manta_treasure_map_630

They allow us to share with others the knowledge of one brave Indiana Jones, the very first discoverer of that road, about the right direction and all associated dangers. Those maps are metadata, ladies and gentlemen, and metadata management as a discipline is simply cartography.

Lords always want more gold. They are born with this obsession. So they continuously invest into better and larger mines, wider and safer roads, and faster and bigger wagons to deliver as much gold as possible in the safest way possible. We have great highways now (yes, I mean those wonderful graphical ETL tools like Informatica PowerCenter, IBM DataStage, and Talend) which are very easy to discover and map for any cartographer, regardless of their level of experience.

But with all those rivers, mountains, swamps, and deserts it is simply not smart or even possible to build highways everywhere. Sometimes we just need a way to quickly walk through a dark forest, and building an expensive highway is not a good choice if we don’t understand what’s on the other side (yes, I am talking about data science!). Sometimes we need to build a bridge across a river or a road through a dangerous swamp. So you need special construction to do that (now I am talking about programming code, like SQL for example, which allows us to overcome almost any engineering difficulty). Unfortunately, mapping complex bridges, desert roads, and special roads through swamps is a very tricky job for any cartographer.

The Exponential Growth

What is even worse, in the last few years the number of mines, roads, and wagons of all kinds has grown exponentially. We build them everywhere and usually there is not enough time to do it according to a bigger plan. It is a huge mess today. Detailed, accurate maps are needed more than ever before. We can get really lost without them. All our gold can be lost forever, in just a second.

Fortunately, there are several great and popular companies doing cartography – Informatica, IBM, Adaptive, Collibra, and Alation, to name a few. But they focus more on how their maps look, how easily they can be used, and how understandable they are for drivers. Unfortunately, they don’t map more dangerous roads (programming code in the form of various SQL scripts or stored procedures) correctly. So the maps they produce are rather incomplete with a lot of “here be dragons” areas, blank spots. And this is exactly where Manta Tools comes in.

We are the Indiana Joneses, the adventurers of modern cartography. We discover dangerous roads where others are afraid to go. We map them and help the bigger cartographers produce as complete and precise maps as possible.

With Manta Tools, there are no “here be dragons” areas anymore.

And all the precious and shiny gold of our lords is much safer with us!

How the API Economy Changes the B2B World

July 31, 2016 by

Have you heard about the API economy? I’ll bet your answer is yes. It is an even buzzier buzzword than #BigData. (OK, not really, but it’s almost there!)

Have you heard about the API economy? I’ll bet your answer is yes. It is an even buzzier buzzword than #BigData. (OK, not really, but it’s almost there!)

Almost every company is now building its public API, trying to open up a little bit to allow its users to access existing services in a programmatic way, not just by working through GUI. Hundreds of mashups (combinations of services) pop up every day. Thousands of new applications (mainly mobile) exist only due to this phenomenon. But is this really something so new?

Not at all. We’ve had all these things for so many years. B2B has been about programmatic integration from the very beginning. The core principles (technology, algorithms) are still the same, but the market size/opportunity is what differs now. Just think about mobile phones – this great and simultaneously sometimes kinda lame (for all those unhappy users of non-Apple phones) piece of equipment is so widespread that, together with widely accessible Internet, it defines a completely new market.

Now it finally makes sense to create public application services because there is such a big chance someone will use it (and maybe even pay for it – that’s why we call it “an economy”). And with a growing number of consumers, the group of producers is also growing. In the end, it really leads to better user experience and satisfaction in many cases. As with all other buzzwords, there is something to it.

What’s up with the API Economy?

I (and everybody else) like the idea of openness and smooth integration. And with a growing number of application services published by highly regulated and conservative companies like banks, I am becoming more and more disappointed with the world of traditional data management tools. As you know, our product Manta Flow is a great solution when you need to understand what is inside your SQL-like code. We are highly specialized in custom code in BI environments, and as such we mostly find ourselves in situations where our customer is interested in extracting valuable metadata from their BI programs and applications and loading it into their existing metadata manager. Why?

Simply because most tools ignore those complex parts of the BI environment like stored procedures, scripts and other kinds of code. This means that the data lineage is not complete, it’s not easy to do impact analyses, and you waste a lot of money on a tool which gives you so little. But to be able to integrate, we need to find a way to load metadata extracted by Manta Flow into an existing tool and combine it with the metadata already stored inside. And now, ladies and gentlemen, the real fun begins.

It’s Not Easy to Be API-Ready

Usually every vendor has some kind of API to support integration. A nice group of well designed services exists with amazing documentation. So you try it. And you fail. So you try it a different way. And you break something. And because you have to come up with a solution to satisfy a customer, you continue – and it hurts. Believe me, we go through this painful process every damn day. How is it possible? Simply because those big monolithic tools were not designed to support integration with other software applications. I understand! To design and develop a big and really pluggable software system is not an easy task, even for experienced engineers. But so many things are broken! Some tools try to analyze SQL code but with bad results. Surprisingly, there is no easy, standardized way to replace wrong metadata without breaking links to other parts of the repository.

Manta_API_Economy_monolite_player_630x450

Some tools support internally many different kinds of metadata structures but offer only a very limited part of their capabilities in the form of API. That means you are forced to work directly with the existing structures in the repository! Some tools have serious performance issues when loading external metadata through their API. There is also one data management product infamously known for its strict legal protection against any integration. Trying to integrate with that tool is like trying to open the doors of a dark dungeon. Such a project can be so painful that the customer decides to stop integration entirely. Does that sound like the 21st century to you? Not to me.

New Hope

But this cloud has a silver lining as several traditional vendors have announced massive redesigns. Also, young and more focused players, like business glossaries, are better designed for integration from the very beginning. (Some great examples are Collibra and Diaku Axon, but there are many more of them.) We implemented several integration scenarios requested by our clients with some of these young stars, and you know what – everything went surprisingly well. It seems that finally, after so many years, the API approach is also entering into our beloved data management world.

And that is great not just for other-than-huge vendors like us, but especially for customers. Metadata is a very complex beast to tame and no product can become the ultimate solution for it. Great API support allows you to combine several great tools to get a more complete solution. And that’s what matters the most.

Long live the API economy!

Any questions or comments? Just let Tomas know at manta@mantatools.com or via the contact form on the right. 

Agile BI Development in 2016: Where Are We?

Agile development was meant to be the cure for everything. It’s 2016 and Tomas Kratky asks the question: where are we?

Agile development was meant to be the cure for everything. It’s 2016 and Tomas Kratky asks the question: where are we?

BI departments everywhere are under pressure to deliver high quality results and deliver them fast. At the same time, the typical BI environment is becoming more and more complex. Today we use many new technologies, not just standard relational databases with SQL interfaces, but for example NoSQL databases, Hadoop, and also languages like Python or Java for data manipulation.

Another issue we have is a false perception of the work that needs to be done when a business user requests some data. Most business users think that preparing the data is only a tiny part of the work and that the majority of the work is about analyzing the data and later communicating the results. Actually, it’s more like this:

agile

See? The reality is completely different. The communication and analysis of data is that tiny part at the top and the majority of the work is about data preparation. Being a BI guy is simply a tough job these days.

This whole situation has led to an ugly result – businesses are not happy with their data warehouses. We all have probably heard a lot of complaints about DWHs being costly, slow, rigid, or inflexible. But the reality is that DWHs are large critical systems, and there are many, many different stakeholders and requirements which change from day to day. In another similar field, application software development, we had the same issues with delivery, and in those cases, agile processes were good solutions. So our goal is to be inspired and learn how agile can be used in BI.

The Answer: Agile?

One very important note – agility is a really broad term, and today I am only going to speak about agile software development, which means two things from the perspective of a BI development team:

1. How to deliver new features and meet new requirements much faster

2. How to quickly change the direction of development

Could the right answer be agile development? It might be. Everything written in the Agile Manifesto makes sense, but what’s missing are implementation guidelines. And so this Manifesto was, a little bit later, enriched with so-called agile principles. As agile became very popular, we started to believe that agile was a cure for everything. This is a survey from 2009 which clearly demonstrates how popular agile was:

Obrázek1

Source: Forrester/Dr. Dobb’s Global Developer Technographic, 2009

And it also shows a few of the many existing agile methodologies. According to some surveys from 2015, agile is currently being used by more than 80% or even 90% of development teams.

Semantic Gap

Later on, we realized that agile is not an ultimate cure. Tom Glib, in his famous article “Value-Driven Development Principles and Values” written in 2010, went a bit deeper. After conducting a thorough study of the failures, mistakes, and also successes since the very beginning of the software industry, one thing became clear – there is something called a semantic gap between business users and engineers, and this gap causes a lot of trouble. Tom Glib hit the nail on the head by saying one important thing: “Rapidly iterating in wrong directions is not progress.” Therefore, the requirements need to be treated very carefully as well.

But even with the semantic gap issue, agile can still be very useful. Over the last ten years the agile community has come up with several agile practices. They are simple to explain things that anyone can start doing to improve his or her software processes. And this is something you should definitely pay attention to. Here you can see agile practices sorted by popularity:

Obrázek1

If you have ever heard about agile, these are probably no surprises for you. The typical mistake made by many early adopters of agile was simply being too rigid; I would call it “fanatic”. It was everything or nothing. But things do not work that way.

It’s Your Fault If You Fail

Each and every practice should be considered a recommendation, not a rule. Your responsibility is to decide if it works for you or not. Each company and each team are different, and if system metaphor practice has no value for your team, just ignore it like we do. Are you unable to get constant feedback from business users? Ok, then. Just do your best to get as much feedback as you need.

On the other hand, we’ve been doing agile for a long time, and we’ve learned that some practices (marked in red) are more important than others and significantly influence our ability to be really fast and flexible.

Obrázek2

There are basically two groups of practices. The first group is about responsibility. A product owner is someone on your side who is able to make decisions about requirements and user needs, prioritize them, evaluate them, and verify them. It can be someone from the business group; but this job is very time consuming, so more often the product owner will be the person on your BI team who knows the most about business. Without such a person, your ability to make quick decisions will be very limited. Making a burndown list is a very simple practice which forces you to clearly define priorities and to select features and tasks with the highest priority for the next release. And because your releases tend to be more frequent with agile, you can always pick only a very limited number of tasks making clear priorities vital.

The second group of critical practices is about automation. If your iterations are short, if you integrate the work of all team members on a daily basis and also want to test it to detect errors and correct them as early as possible, and if you need to deliver often, you will find yourself and your team in a big hurry without enough time to handle everything manually. So automation is your best friend. Your goal is to analyze everything you do and replace all manual, time-consuming activities with automated alternatives.

What Tools To Use?

Typical tools you can use include:

1. Modern Version Control Systems

A typical use case involves a GIT, SVN, or Team Foundation Server storing all pieces of your code, tracking versions/changes, merging different branches of code, etc. What you are not allowed to do is use shared file systems for that. Unfortunately, it is still quite a common practice among BI folks. Also, be careful about using BI tools which do not support easy, standard versioning. Do not forget that even if you draw pictures, models, or workflows and do not write any SQL, you are still coding.

So a good BI tool stores every piece of information in text-based files – for example XMLs. That means you can make them part of a code base managed by GIT for example. A bad BI tool stores everything in binary and proprietary files, which can’t be managed effectively by any versioning system. Some tools support a kind of internal versioning, but those are still a big pain for you as a developer and they lead to fragmented version control.

2. Continuous Integration Tools

You’ll also need tools like Maven and Jenkins or PowerShell and TeamCity to do rapid and automated build and deploy of your BI packages.

3. Tools for Automated Code Analysis and Testing

I recommend using frameworks like DB Fit at least to write automated functional tests and also using a tool for static code analysis to enforce your company standards, best practices, and code conventions (Manta Checker is really good at that). And do not forget – you can’t refactor your code very often without proper testing automation.

4. Smart Documentation Tools

In the end, you can’t work on the parts of your system you do not understand. The best combination of tools you can get is something like Wiki to capture basic design ideas and a smart documentation tool able to generate detailed documentation when needed in an automated way. Today there are many very good IDEs that are able to generate mainly control-flow and dependency diagrams. But we are BI guys, and there is one thing that is extremely useful for us – it is called data lineage, or you can call it data flow.

Simply put, it’s a diagram showing you how data flows and is transformed in your DWH. You need data lineage to perform impact analyses and what-if analyses as well as to refactor your code and existing data structures. There are almost no solutions on the market which are able to show you data lineage from your custom code (except our Manta Flow, of course).

And that’s it. Of course, there are some other more advanced practices to support your agility, but this basic stuff is, I believe, something which can be implemented quickly from the perspective of both processes and tools. I definitely suggest starting with a smaller more experienced team, implementing the most important practices, playing around a little bit, and measuring the results of different approaches. I guarantee that you and your team will experience significant improvements in speed and flexibility very soon.

Do you have any questions or comments? Send them directly to Tomas Kratky at manta@mantatools.com! Also, do not forget to follow us on Twitter and LinkedIn

Big Data? There Is No Silver Bullet

“Big Data is not a system; it is simply a way to say that you have a lot of data.” Tomas Kratky explores why we shouldn’t ditch the data warehouse because there is, in fact, no successor yet. 

“Big Data is not a system; it is simply a way to say that you have a lot of data.” Tomas Kratky explores why we shouldn’t ditch the data warehouse because there is, in fact, no successor yet. 

Just a few days ago LinkedIn notified me about one very interesting research study “Is the Data Warehouse Dead?” (done by The Information Difference and sponsored by IBM in January 2015). Although this study is more than a year old, it is, I believe, still valid. The focus of the study was simply the “data warehouse” itself and its future existence in the new world of Big Data technologies. I am happy that someone has started to ask more challenging questions in a world where the buzzwords Big Data are a cure for everything. I definitely suggest that everyone read the study, but I still have a few serious issues with it.

Not a System, Just Data

The whole idea of asking if the DWH is going to be replaced by Big Data is stupid (please, pardon my language). The DWH is a critical system which is responsible for gathering and consolidating data and providing the information needed to run a business. On the contrary, Big Data is, as defined in the study, “the term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Do you see the problem? Big Data is not a system; it is simply a way to say that you have a lot of data. It does not make sense to ask if Big Data is going to replace DWHs. The right question to ask is if Big Data technologies are going to penetrate our existing DWH world and how.

Everybody’s Happy, Right?

Two questions in the study were aimed at the costs of DWH maintenance and user satisfaction. The result was that only 27% of companies are unhappy about the costs. WTF? And less than 10% of respondents find their DWH unsatisfactory (from the usage perspective). Really? Every day I hear companies complaining about DWH maintenance costs being so high, changes being so slow, DWH teams being so rigid and formal, data quality being so bad, and I could go on and on. So what is the issue?

Just take a look at the types of respondents surveyed. No surprise that around 80% of them are IT guys. IT is usually responsible for DWH design, development, and maintenance. If someone asks you about a house you built with your own hands and which is your responsibility, will you tell him that the house is bad, expensive, and hard to live in? I would say not; your response will typically be “yeah man, it’s a great house” or “it’s not bad” at worst. Dear analysts, go ask CFOs and business users and your survey will look completely different.

Facts First

But despite those things I don’t like about the study, the parts with actual facts are really useful. For example – most companies have more than two DWHs (70%) with less than 50 terabytes of data (53%) and need at least 7 people just to maintain the DWH (up to hundreds). It is also obvious that most companies are still playing with Big Data technologies without any need or proper plan.

I’m suspicious about all those consultants, analysts, and “industry experts” talking about the importance of Big Data. All of them are showing us examples of sensors. But you know what? Most companies do not have any sensors or, to be honest, do not need any sensors. And data coming from web logs and mobile apps? That’s nothing special.

Another argument for Big Data is that it allows you to store everything that you will need in the future. Okay, do you really believe that a company can store all its data on Hadoop? Of course not! You can’t store everything. It is not even possible to say what “everything” is. You must choose what to store. And we are still not able to foresee future, so we simply select those data sets which we believe could be most useful later on.

Big Data Is Not for Everything (Or for Everyone)

And this is exactly why almost all companies are still struggling with Big Data – trying to figure out how to use it effectively without destroying their existing infrastructure. And this is also a reason why the study revealed no success stories and no really big data out there managed by Big Data technologies. And you know what the saddest thing is about the whole study? Question number 8 asked respondents if they view Big Data as being important for their organization. And you know what? 64% consider Big Data to be important or even very important. That’s actually not surprising. Most of the respondents were IT guys, and all of us engineers love to play with new toys. But why is Big Data important for your business? And how exactly do they plan to use those technologies?

I see the DWH as a complex system where multiple technologies are needed. And Big Data technologies are a nice evolutionary step. Relational databases, NoSQL databases, Hadoop, Spark, and other components have their value when used properly. My experience is that Big Data technologies are very efficient for storing and processing unstructured data. We can also use Big Data technologies very effectively to build staging areas, to replace old ways of doing ETLs, and to reduce licensing costs significantly (one of our clients recently shared with me that more than 50% of their Teradata processing capacity is used for ETLs).

Futurum Immaturis

This is exactly where all development is heading right now. We have already met a significant number of customers redesigning their DWH in this manner. And most of the respondents in the study feel the same way. But we must be very careful with the governance of new modernized DWHs. Everything is so young, immature. When rebuilding your staging area you can’t loose your ability to gather and use metadata, to work with data lineage and do impact analyses; you can’t exclude it from data quality processes; you can’t put everything you’ve built over the last few years in danger. If you screw it up, your maintenance costs will simply blow up.

People have their limitations. Solving complex problems is a hard job for most of us. There’s nothing bad about it; our brain has its boundaries and DWH design is a really complex task. Things change every day, every minute and must be incorporated to continue supporting business needs. It is no surprise that in most organizations, the DWH has serious problems after several years. An incoming Big Data hype might push those who are less experienced to rebuild the whole DWH with those shiny new Big Data technologies. Surprisingly, it will only lead to a bigger mess with two overlapping systems, no data governance, and maintenance costs that are through the roof.

One of respondents described it nicely: “I suspect that ‘Big Data’ is a way to THINK that you are obtaining good data while avoiding the hard work of understanding and designing data models.” Turing Award Winner Fred Brooks once said: “There is no silver bullet.” And, in fact, there is none.

INFOGRAPHIC: Custom Code in Your BI

September 17, 2015 by

Custom code is in every BI environment. But where exactly?

Custom code is in every BI environment. But where exactly?

Over the past two years, we’ve spent a tremendous amount of time explaining the existence and risks of the hand-coded parts of business intelligence environments. We’ve covered pretty much everything – labor savings, easy impact analyses, fast automatic documentation, and all of the others.

But there was something missing – a fine graphical representation. We made one for you. Yeah, I know, every representation is somehow simplified. But for us it is still useful as a blueprint for discussions with our customers and prospects. So, where can you usually find custom code in BI?

Print

(click for bigger image)

1) Data Consolidation. The first table and first arrow represent the data consolidation process. The goal here is to drag data from all available systems, transform it, and ensure certain data quality. This part is mostly done by hand-coded tools (a mix of a simple scheduler and a lot of SQL scripts which typically consist of stored procedures that are sometimes partially automatically generated, a mix quite common in smaller organizations) or by ETL tools (mostly in bigger enterprises). The problem is that using ETL tools does not ensure an absence of messy custom code (think about, for example, more complex business logic which is hard to implement using ETL tools or the SQL overrides so often used in Informatica PowerCenter and other similar tools) – it is present in every environment there is.

2) Traditional DWH Area. The fourth table represents a total summary of the “traditional DWH area” – usually meaning an operational (sometimes even real-time) data store, enterprise data warehouse, and other data marts and databases around it. Simple tasks in these environments are usually solved by dumb-proof vendor tools, but when the system hits a certain level of complexity, custom code comes back into play. Especially in connection with preparation of data for business oriented data marts (marketing, controlling, etc.).

3) Reporting. This kind of app uses data from the previous step, usually very well prepared in various data marts. But we do not live in an era of predefined reports and dashboards anymore. Today everyone wants to play with data – simply said, self-service BI is here. What does that mean? The number of reports and analyses is growing rapidly, and sometimes data prepared in our data mart is simply not enough to get our numbers. So we need to transform the data a little bit more, and voila! We typically use the most widespread transformation and data preparation language – SQL (or something very similar). Today the reports themselves are prepared using nice looking UI and wizards, but for various (usually performance) reasons, there are often a lot of SQL overrides present.

4) Analytics. And analytics… this is a completely new and “full-of-custom-code” story. Over the last year the Big Data and Data Science boom has changed everything we know about data management. And it is just the beginning! Enterprises are slowly drowning in data swamps (sorry data lakes). PhDs and MScs are playing with data, creating smart algorithms for predictions or segmentations, and literally everything is hand-coded using a variety of languages starting with R or Python and ending with good old SQL (or its Hadoop variations like HiveQL or Pig).

There is no BI without chaotic and messy custom code. Gaining a full understanding of hard-coded script is necessary not only because you need to know what is happening inside, but also to be sure it’s all working like a Swiss timepiece.

Any thoughts or comments? Please, let us know in the form on the right or via email at manta@mantatools.com. Also, do not forget to follow us on Twitter!

 

Subscribe to our newsletter

We cherish your privacy.

By using this site, you agree with using our cookies.