Tomáš Krátký

Is Guessing Good Enough for Your GDPR Project?

June 28, 2018 by

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.

I will tell you one thing — I am tired of GDPR buzz. Don’t get me wrong, I value privacy and data protection very much, but I hate the way how almost every vendor uses it to sell their goods and services, so much that the original idea is almost lost.

It is similar to BCBS or other past data-oriented regulations. Consulting companies, legal firms, data governance/ security/ metadata vendors, we are all the same — buy our “thing” and you will be ok or at least safer with us! Every second book out there tells us that every change is an opportunity to improve, to evolve. So, what is the improvement here with GDPR? If I look around I see a lot of legal work being done, adding tons of words (very small letters, as always) to already long Terms & Conditions. And you know what? I don’t think there is any real improvement in it.

But things are not always so bad. There is also a lot of good stuff going on with one important goal—to better understand and govern data and its lifecycle in a company. And there is one challenging but critical part I want to discuss today—Data Lineage. That means how data is moved around in your organization. You must understand that a customer’s email address or their credit card number is not just in your CRM but is spread all over your company in tens or even hundreds of systems — your ERP, data warehouse, reporting, new data lake with analytics, customer portal, numerous Excel sheets and even external systems. The path of the data you collect can be very complex, and if you think about all possible ways you can move and transform data in your company, one thing should be clear — your data lineage has to be automated as much as possible.

Different Approaches to Data Lineage

That being said, I feel it is important to talk about different approaches to data lineage that are used by data governance vendors today. Because when you talk about metadata, you very often think about simple things — tables, columns, reports. But data lineage is more about logic — programming code in any form. It can be an SQL script, PL/SQL stored procedure, Java program or complex macro in your Excel sheet. It can literally be anything that somehow moves your data from one place to another, transforms it, modifies it. So, what are your options for understanding that logic?

This article is base on a presentation by Jan Ulrych at the DGIQ 2018 Conference. Click on any image to open the gallery.

Option 1: Ignore it! (aka data similarity lineage)

No, I am not crazy! There are products building lineage information without actually touching your code. They read metadata about tables, columns, reports, etc. They profile data in your tables too. And then they use all that information to create lineage based on similarities. Tables, columns with similar names and columns with very similar data values are examples of such similarities. And if you find a lot of them between two columns, you link them together in your data lineage diagram. And to make it even more cool, vendors usually call it AI (another buzzword I really hate). There is one great thing about this approach — if you watch data only, and not algorithms, you do not worry about technologies and it is no big deal if the customer uses Teradata, Oracle or MongoDB with Java on top of it. But on the other hand, this approach is not very accurate, performance impact can be significant (you work with data) and data privacy is at risk (you work with data). There are also a lot of details missing (like transformation logic for example, which is very often requested by customers) and lineage is limited to the database world, ignoring the application part of your environment.

Option 2: Do the “business” lineage manually

This approach usually starts from the top by mapping and documenting the knowledge in people’s heads. Talking to application owners, data stewards and data integration specialists should give you fair but often contradictory information about the movement of data in your organization. And if you miss talking to someone you simply don’t know about, a piece of the flow is missing! This often results in the dangerous situation where you have lineage but are unable to use it for real case scenarios — not only can you not trust your data, you cannot trust the lineage either.

Jan Ulrych presenting at DGIQ 2018.

Option 3: Do the technical lineage manually

I will get straight to the point here — trying to analyze technical flows manually is simply destined to fail. With the volume of code you have, the complexity of it and the rate of change, there’s no way to keep up with it. When you start considering the complexity of the code and especially the need to reverse engineer the existing code, this becomes extremely time consuming and sooner or later such manually managed lineage will fall out of sync with the actual data transfers within the environment and you will end up with the feeling that you have lineage you cannot actually trust.
Now that we know that automation is key, let’s take a look at some less labor intensive and error prone approaches.

Option 4: Trace it! (aka data tagging lineage)

Do you know the story of Theseus and the Minotaur? The Minotaur lives in a labyrinth and so does Ariadne who is in charge of the labyrinth. Ariadne gives Theseus a ball of thread to help him navigate the labyrinth by being able to retrace his path.

And this approach is a bit similar. The whole idea is that each piece of data that is being moved or transformed is tagged/labeled by a transformation engine which then tracks that label the whole way from start to finish. It is like Theseus. This approach looks great, but it only works well as long as the transformation engine controls the data’s every movement. A good example is a controlled environment like Cloudera. If anything happens outside its walls, the lineage is broken. It is also important to realize that the lineage is only there if the transformation logic is executed. But think about all the exceptions and rules that apply only once every couple of years. You will not see them in your lineage till they are executed. That is not exactly healthy for your data governance, especially if some of those pieces are critical to your organization.

Option 5: Control it! (aka self-lineage)

The whole idea here is that you have an all-in-one environment that gives you everything you need—you can define your logic there, track lineage, manage master data and metadata easily, etc. There are several tools like this, especially with the new big data/ data lake hype. If you have a software product of this kind, everything happens under its control — every data movement, every change in data. And so, it is easy for a such a tool to track lineage. But here you have the very same issue as in the previous case with data tagging. Everything that happens outside the controlled environment is invisible, especially when you consider long-term manageability. Over time, as new needs appear and new tools are acquired to address them, gaps in the lineage start to appear.

Option 6: Decode it! (aka decoded lineage)

Ok, so now we know that logic can be ignored, traced with tags and controlled. But all those approaches fall short in most real-life scenarios. Why? Simply because the world is complex, heterogeneous, wild and most importantly — it is constantly evolving. But there is still another way — to read all the logic, to understand it and to reverse engineer it. That literally means to understand every programming language used in your organization for data transformations and movements. And by programming language I mean really everything, including graphic and XML based languages used by ETL tools and reports. And that is the challenging part. It is not easy to develop sufficient support for one language, let alone the tens of them you need in most cases to cover the basics of your environment. Another challenging issue is when the code is dynamic, which means that you build your expressions on the fly based on program inputs, data in tables, environmental variables, etc. But there are ways to handle such situations. On the other hand, this approach is the most accurate and complete as every single piece of logic is processed. It also guarantees the most detailed lineage of all.

And that’s it. This was not meant to be a scientific article, but I wanted to show you the pros and cons of several popular data lineage approaches. Which leads me back to my first GDPR paragraph. I see enterprises investing a lot of money in data governance solutions with insufficient data lineage capabilities, offering tricks like data similarity, data tagging and even self-lineage. But that is just guesswork, nothing more. Guesswork with a lot of issues and manual labor to correct the lineage.

So, I am asking you once again — is guessing good enough for your GDPR project?

This article was also published on Tomas Kratky’s Linkedin Pulse.

The Year of MANTA and Why We’ve Published Our Pricing Online

December 31, 2017 by

We’ve seen a massive surge in the world of data lineage over the last year.  More buzz, more leads, more customers for us and (from what I’ve heard) for other metadata players as well. It might come as a bit of a disruption, but we’ve decided to do something which is very common in other industries, but not in ours. We’ve published our pricing online. Why?

We’ve seen a massive surge in the world of data lineage over the last year.  More buzz, more leads, more customers for us and (from what I’ve heard) for other metadata players as well. It might come as a bit of a disruption, but we’ve decided to do something which is very common in other industries, but not in ours. We’ve published our pricing online. Why?

The Year We’ve Come Through

2017 is coming to an end, and so it is the right time to take a look back. It was a very hot year for metadata and data governance. Partially thanks to the new GDPR regulation, but there are more reasons behind it – more and more enterprises have come to the understanding that the only way to build an efficient data-driven company is through proper data governance. In 2016, data got a lot of attention, how big it is or can potentially be, how to manage large volumes, velocity, and variety in data.

In 2017, we all started to realize that it is not just about data, but also a lot about data algorithms – the way your data is and how it’s gathered, transferred, merged, processed, and moved around your company. Thanks to GDPR, internal discussions have been initiated about how and where sensitive / protected data elements are used, and suddenly, it turns out that we are flooded not just with data but with data algorithms too, and it is impossible to handle it all manually without automation.

That has drawn even more attention to MANTA and its unique data lineage automation capabilities. Our website basically exploded – our audience doubled and the use of our live demo nearly tripled. We have on-boarded several amazing new customers from all around the world, and we delivered four major releases this year, with plenty of new features in all of them including Public & Service APIs and new technologies (SSIS, IBM Netezza, IBM DB2, and Impala, to name a few). Simply said, 2017 was a fantastic year and more is coming in 2018!

And even though this year was yet another giant step for MANTA, we decided to do one more thing that will shake things up. We’ve done something that’s pretty common in all the other industries except ours.

Yes, we’ve published our pricing online for everyone to see.

And why?

MANTA is taking the lead in transparency and openness

Sometimes there are good reasons for hiding the price of your product or service. And it is common practice in the enterprise software industry. But does that really make sense? Let’s take a look at the usual reasons, then:

1) You might be legally bound to hide the price by a government or its suppliers. Yes, national security is a serious issue, and there might be some limitations put on companies which deal with it. But that works only for individual deals and is hardly a reason to hide the price.

2) You want to participate in tenders with secret bids. Yes, that also makes sense – especially when you are dealing with clients that focus only on the price. You would not want to lose just because your bid is a few thousand higher, would you? Perhaps not, but this is not our case – MANTA is a very unique software product with clear and easy to see value for its users. The price has to be reasonable, but it is rarely a way how to win anyone’s business.

3) You want to keep everybody in the dark. Yes, some do want that. But frankly, it’s a rather dishonest strategy. It’s foolish to expect that customers do not know other players on the market and their prices. It’s even more foolish to try to control the market by spreading rumors and making deals in the shadows.

When you are confident of your product and what it stands for, you are also confident of its price. There’s no reason to follow the “industry standard” by not disclosing the enterprise IT product prices. So dive into our pricing right here and if there’s something that needs clarification, just take a look at our pricing glossary right below it.

Thank you for your support this year and see you in 2018!




A Metadata Map Story: How We Got Lost When Looking for a Meeting Room

September 1, 2017 by

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much funnier place. But no, I am not crazy and this is not a funny story. [LONG READ]

You may think that I have gone crazy after reading the title above or hope that our blog is finally becoming a much funnier place. But no, I am not crazy and this is not a funny story. [LONG READ]

It is, surprisingly, a metadata story. A few months ago, when visiting one of our most important and hottest prospects, we arrived at the building (a super large finance company with a huge office), signed in and passed through security, called our main contact there, shook hands with him, and entered their private office space with thousands of work desks and chairs, plus many restrooms, kitchens, paintings, and also meeting rooms.

The Ghost of Blueberry Past

A very important meeting was ahead of us, with the main business sponsor who had significant influence over the MANTA purchasing process. Our main agenda was to discuss business cases involving metadata and the role of Manta Flow. So we followed our guide and I asked where we were going. “The blueberry meeting room”, he replied. We stopped several times, checking our current position on the map and trying to figure out where to go next. (It is a really super large office space.) After 10 minutes, we finally got very close, at least according to the map. Our meeting room should have been, as we read it on the map, straight and to the left. But it was not! We ran all over the place, looking around every corner, checking the name printed on every meeting room door, but nothing. We were lost.

Fortunately, there was a big group of people working in the area, so we asked those closest to us. Several guys stood up and started to chat with us about where that room could be. Some of them started to search for the room for us. And luckily, there was one smart and knowledgeable woman who actually knew the blueberry meeting room very well and directed us to it. In 20 seconds, we were there with the business sponsor, although we were a few minutes late. Uffff.

That’s a Suggestive Question, Sir!

Our gal runs a big group, a business, and BI analysts who work with data every single day – they do impact and what-if analyses for the initial phase of every data-related project in the organization. They also do plenty of ad-hoc analyses whenever something goes wrong. You know, to answer those tricky management questions like:

“How did it happen that we didn’t approve this great guy for a loan five months ago?”


“Tell me if there is any way a user with limited access can see any reports or run any ad-hoc queries on sensitive and protected data that should be invisible to her?”

And I knew that they had very bad documentation of the environment, non-existing or obsolete (which is even worse) as do many organizations out there, most of it in Excel sheets that were manually created for compliance reasons and uploaded to the Sharepoint portal. And luckily for us, they had recently started a data governance project with only one goal – to implement Informatica Metadata Manager and build a business glossary and an information catalog with a data lineage solution in it. It seemed to be a perfect time for us with our unique ability to populate IMM with detailed metadata extracted from various types of programming code (Oracle, Teradata, and Microsoft SQL in this particular environment).

Just Be Honest with Yourself: Your Map Is Bad

So I started my pitch about the importance of metadata for every organization, how critical it is to cover the environment end-to-end, and also the serious limitations IMM has regarding programming code, which is widely used there to move and transform data and to implement business logic. But things went wrong. Our business sponsor was very resistant to believe the story, being pretty OK with what they have now as a metadata portal. (Tell me, how anyone can call Sharepoint with several manually created and rarely updated Excel sheets, a metadata portal? I don’t understand!) She asked us repeatedly to show her precisely how we can increase their efficiency. And she was not satisfied with my answers based on our results with other clients. I was lost for the second time that day.

And as I desperately tried to convince her, I told her the story about how we get lost and mixed it with our favorite “metadata like a map, programming code like a tricky road” comparison. “It is great that you even have a map”, I told her. “This map helped us to quickly get very close to the room and saved us a lot of time. But even when we were only 40 meters from our target, we spent another 10 minutes, the very same amount of time needed to walk all the way from the front desk to that place, looking for our room. Only because your great map was not good enough for the last complex and chaotic 5% of our trip. And what is even worse, others had to help us, so we wasted not only our time, but also theirs. So this missing piece of the map led to multiple times increased effort and decreased efficiency. And now think about what happens if your metadata map is not complete from 40% to 50%, which is the portion of logic you have hidden here inside various kinds of programming code invisible to IMM. Do you really want to ignore it? Or do you really want to track it and maintain it manually?”

And that was it! We got her. The rest of our meeting was much nicer and smoother. Later, when we left, I realized once again how important a good story is in our business. And understandability, urgency and relevance for the customer are what make any story a great one.

And what happened next? We haven’t won anything yet, it is still an open lead, but now nobody has doubts about MANTA. They are struggling with IMM a little bit. So we are waiting and trying to assist them as much as possible, even with technologies that are not ours. Because in the end it does not matter if we load our metadata into IMM or any other solution out there. As long as there is any programming code there, we are needed.

This article was originally published on Tomas Kratky’s LinkedIn Pulse.

Return of the Metadata Bubble

July 27, 2017 by

The bubble around metadata in BI is back – with all it’s previous sins and even more just around the corner. [LONG READ]  

The bubble around metadata in BI is back – with all it’s previous sins and even more just around the corner. [LONG READ]  

In my view, 2016 and 2017 are definitely the years for metadata management and data lineage specifically. After the first bubble 15 years ago, people were disappointed with metadata. A lot of money was spent on solutions and projects, but expectations were never met (usually because they were not established realistically, as with any other buzzword at its start). Metadata fell into damnation for many years.

But if you look around today, visit few BI events, read some blog posts and comments on social networks, you will see metadata everywhere. How is it possible? Simply, because metadata has been reborn through the bubble of data governance associated with big data and analytics hype. Could you imagine any bigger enterprise today without a data governance program running (or at least in its planning phase)? No! Everyone is talking about a business glossary to track their Critical Data Elements, end-to-end data lineage is once again the holy grail (but this time including the Big Data environment), and we get several metadata related RFPs every few weeks.

Don’t get me wrong, I’m happy about it. I see proper metadata management practice to be a critical denominator for the success of any initiative around data. With huge investments flowing into big data today, it is even more important to have proper governance in place. Without it, no additional revenue, chaos, and lost money would be the only outcome of big (and small) data analytics. My point is that even if everything looks promising on the surface, I feel a lot of enterprises have taken the wrong approach. Why?

A) No Numbers Approach

I have heard so often that you can’t demonstrate with numbers how metadata helps an organisation. I couldn’t disagree more. Always start to measure efficiency before you start a data governance/metadata project. How many days does it take, on average, to do an impact analysis? How long does it take, on average, to do an ad-hoc analysis. How long does it take to get a new person onboard – data analyst, data scientist, developer, architect, etc. How much time do your senior people spend analysing incidents and errors from testing or production and correcting them? My advice is to focus on one or two important teams and gather data for at least several weeks, or better yet, months. If you aren’t doing it already, you should start immediately.

You should also collect as many “crisis” stories as you can. Such as when a junior employee at a bank mistyped an amount in a source system and a bad $1 000 000 transaction went through. They spent another three weeks in a group of 3 tracking it from its source to all its targets and making corrections. Or when a finance company refused to give a customer a big loan and he came to complain five months later. What a surprise when they ran simulations and found out that they were ready to approve his application. They spent another 5 weeks in a group of 2 trying to figure out what exactly happened to finally discover that a risk algorithm in use had been changed several times over the last few months. When you factor in bad publicity related to this incident, your story is more than solid.

Why all this? Because using your numbers to build a business case and comparing them with numbers after a project to demonstrate efficiency improvements and those well-known, terrifying stories that cause so many troubles to your organisation, will be your “never want it to happen again” memento.

B) Big Bang Approach

I saw several companies last year that started too broad and expected too much in very short time. When it comes to metadata and data governance, your vision must be complex and broad, but your execution should be “sliced” – the best approach is simply to move step-by-step. Data governance usually needs some time to demonstrate its value in reduced chaos and better understanding between people in a company. It is tempting to spend a budget quickly, to implement as much functionality as possible and hope for great success. In most cases, however, it becomes a huge failure. Many, good resources are available online on this topic, so I recommend investing your time to read and learn from others’ mistakes first.

I believe that starting with several, critical data elements most often used is the best strategy. Define their business meaning first, than map your business terms to the the real world and use an automated approach to track your data elements both at a business and technical level. When the first, small set of your data elements is mapped, do your best to show their value to others (see the previous section about how to measure efficiency improvements). With success, your experience with other data sets will be much smoother and easier.

C) Monolithic Approach

you collect all your metadata and data governance related requirements from both business and technical teams, include your management and other key stakeholders, prepare a wonderful RFP and share it with all vendors from the top right Gartner Data Governance quadrant (or Forrester wave if you like it more). You meet well-dressed sales people and pre-sales consultants, see amazing demonstrations and marketing papers, hear a lot of promises how all your requirements will be met, pick up a solution you like, implement it, and earn you credit. Prrrrr! Wake up! Marketing papers lie most of the time (see my other post on this subject).

Your environment is probably very complex with hundreds of different and sometimes very old technologies. Metadata and data governance is primarily an integration initiative. To succeed, business and IT has to be put together – people, systems, processes, technologies. You can see how hard it is, and you may already know it! To be blunt, there is no single product or vendor covering all your needs. Great tools are out there for business users with compliance perspectives such as Collibra or Data3Sixty, more big data friendly information catalogs such as Alation, Cloudera Navigator, or Waterline Data, and technical metadata managers such as IBM Governance Catalog, Informatica Metadata Manager, Adaptive, or ASG. Each one of them, of course, overlaps with the others. Smaller vendors then also focus on specific areas not covered well by other players. Such as MANTA, with the unique ability to turn your programming code into both technical and business data lineage and integrate it with other solutions.

Metadata is not an easy beast to tame. Don’t make it worse by falling into the “one-size-fits-all” trap.

D) Manual Approach

I meet a lot of large companies ignoring automation when it comes to metadata and data governance. Especially with big data. Almost everyone builds a metadata portal today, but in most cases it is only a very nice information catalog (the same sort you can buy from Collibra, Data3Sixty, or IBM) without proper support for automated metadata harvesting. The “How to get metadata in” problem is solved in a different way. How? Simply by setting up a manual procedure – whoever wants to load a piece of logic into DWH or Data lake has to provide associated metadata describing meaning, structures, logic, data lineage, etc. Do you see how tricky this is? On the surface, you will have a lot of metadata collected, but every bit of information is not reality – it is a perception of reality and only as good as the information input by a person. What is worse, is that it will cost you a lot of money to keep synchronised with real logic during all updates, upgrades, etc. The history of engineering tells us clearly one fact – any documentation, especially documentation not an integral part of your code/logic, created and maintained manually, is out of date the very same moment it was created.

Sometimes there is a different reason for harvesting metadata manually – typically when you choose a promising DG solution, but it turns out that a lot is missing. Such as when your solution of choice cannot extract metadata from programming code and you end up with an expensive tool without the important pieces of your business and transformation logic inside. Your only chance is to analyse everything remaining by hand, and that means a lot of expense and a slow and error-prone process.

Most of the time I see a combination of a), c) and d), and in rare cases also with b). Why is that? I do not know. I have plenty of opinions but none of them have been substantiated. One thing for sure is that we are doing our best to kill metadata, yet again. This is something I am not ready to accept. Metadata is about understanding, about context, about meaning. Companies like Google and Apple have known it for a long time, which is why they win. The rest of the world is still behind with compliance, regulations being the most important factor why large companies implement data governance programs.

I am asking every single professional out there to fight for metadata, to explain that measuring is necessary and easy to implement, small steps are much safer and easier to manage than a big bang, an ecosystem of integrated tools provides greater coverage of requirements than a huge monolith, and that automation is possible.

Tomas Kratky is the CEO of MANTA and this article was originally published on his LinkedIn Pulse. Let him know what you think on

One Small Step for MANTA, One Big Leap for Mankind

June 30, 2017 by

Tomas Kratky explores his vision behind MANTA’s new capability to visualize business & logical lineage.

Tomas Kratky explores his vision behind MANTA’s new capability to visualize business & logical lineage.

We just recently published a blog post announcing one new feature – MANTA now works not only with physical lineage but with business and logical lineage as well. I was shocked by the intensity of the feedback we got from our customers and partners – they were confused. MANTA has a clear vision to provide users with the most detailed, accurate, and fully automated data lineage from programming code. We do it because all data-driven organizations need it, because others are afraid to do it, and because we are smart.

New Levels of Lineage

But now we have announced business lineage and everyone has been asking what that means. Is MANTA moving towards being a more general metadata or data governance solution? NOT AT ALL! So why the business and logical lineage? Let me explain a little bit more.

MANTA offers capabilities not covered by other players, capabilities very much needed in any data intensive environment. But MANTA is not a metadata manager or information catalog. There are other better equipped vendors like IBM, Informatica, Collibra, Alation, Adaptive, etc. This means that with some exceptions MANTA alone does not meet all the metadata related requirements of a customer. But other metadata solutions, when selected, purchased, and deployed by a customer, also fail to meet several critical needs related to metadata accuracy and completeness, especially regarding data processing logic hidden inside programming code. This leads to an inevitable conclusion – MANTA is usually served together with other tool(s).

MANTA: Born To Integrate

Simply said, we live and die with great integrations. We have many prospects out there, since almost everyone will need us sooner or later, but to fully demonstrate our value, we need smooth integration with existing data governance / metadata solutions. We originally started with more technical oriented tools like Informatica Metadata Manager, so physical lineage was the best option. But now more and more customers have Collibra, IBM Information Governance Catalog, Alation, Data3Sixty, or Axon, and they want to see lineage there. But those solutions are not designed to capture and visualize large amounts of data processing metadata. They tend to slow down or even crash with the millions of processing steps you have in your environment.

Automate or Drown

Some vendors in this space don’t even offer automated harvesting capabilities. Some of them do, but in a limited way. So I very often see customers trying to build simple business level lineage manually. And this is where our unique features come into play. MANTA still harvests physical technical metadata from your programming code but is now also able to use existing business or logical mappings to prepare a different perspective – simplified, with easier to understand names and descriptions, but still accurate, complete, and fully automated. It allows us to easily integrate with all the not-so-technical solutions mentioned above. It means less wasted effort and fewer stressful moments for our customers and more prospects for MANTA. I see it as a win-win situation.

This article was originally published on Tomas Kratky’s LinkedIn Pulse.

The Dark Side of the Metadata & Data Lineage World

June 10, 2017 by

You wouldn’t believe it, but there is a dark side to the metadata & data lineage world as well. Tomas Kratky digs deep and explains how you can get into trouble. 

You wouldn’t believe it, but there is a dark side to the metadata & data lineage world as well. Tomas Kratky digs deep and explains how you can get into trouble. 

It has been a wonderful spring this year, hasn’t it? The first months of 2017 were hot for us. Data governance, metadata, and data lineage are everywhere. Everyone is talking about them, everyone is looking for a solution. It’s an amazing time. But there is also the other side, the dark side.

The Reality of Metadata Solutions

As we meet more and more large companies, industry experts & analysts, investors and just data professionals, we see a huge gap between their perception of reality and reality itself. What am I talking about? About the end-to-end data lineage ghost. With data being used to make decisions every single day, with regulators like FINRA, Fed, SEC, FCC, and ECB requesting reports, with initiatives like BCBS 239 or GDPR (a new European Data Protection Directive), proper governance and a detailed understanding of the data environment is a must for every enterprise. And E2E (end-to-end) data lineage has become a great symbol of this need. Every metadata/data governance player on the market is talking about it and their marketing is full of wonderful promises (in the end, that is the main purpose of every marketing leaflet, isn’t it?). But what’s the reality?

The Automated Baby Beast

The truth is, that E2E data lineage is a very tough beast to tame. Just imagine how many systems and data sources you have in your organization, how much data processing logic, how many ETL jobs, how many stored procedures, how many lines of programming code, how many reports, how many ad-hoc excel sheets, etc. It is huge. Overwhelming!

If your goal is to track every single piece of data and to record every single processing step, every “hop” of data flow through your organization, you have a lot of work to do. And even if you split your big task into smaller ones and start with selected data sets (so-called “critical data elements”) one by one, it can still be so exhausting that you will never finish or even really start. And now data governance players have come in with gorgeous promises packaged in one single word – AUTOMATION.

The promise itself is quite simple to explain – their solutions will analyze all data sources and systems, every single piece of logic, extract metadata from them (so-called metadata harvesting), link them up (so-called metadata stitching), store them and make them accessible to analysts, architects, and other users through best-in-class, award-winning user interfaces. And all of this through automation. No manual work necessary, or just a little bit. Is it so tempting that you are open to it, you want to believe. And so you buy the tool. And then the fun part starts.

The Machine Built to Fail

Almost nothing works as expected. But somehow you progress with the help of hired (and usually overpriced) experienced consultants. Databases (tables, columns) are there, your nice graphically created ETL jobs are there, your first simple reports also, but hey! There is something missing! Why? Simply because you used a nasty complex SQL statement in your beautiful Cognos report. And you used another one when you were not satisfied with the performance of one Informatica PowerCenter job. And hey! Here, lineage is completely broken? Why is THAT? Hmmm, it seems that you decided to write some logic inside stored procedures and not to draw a terrifying ETL workflow, simply because it was so much easier with all those Oracle advanced features. Ok, I believe you have got it. Different kinds of SQL code (and not just SQL but also Java, C, Python and many others) are everywhere in your BI environment. Usually, there are millions and millions of lines of code everywhere. And unfortunately (at least for all metadata vendors) programming code is super tough to analyze and extract the necessary metadata from. But without it, there is no E2E data lineage.

At this moment, marketing leaflets hit the wall of reality. As of today, we have met a lot of enterprises but only very few solutions capable of automated metadata extraction from SQL programming code. So what do most big vendors usually do in this situation (or big system integrators implementing their solutions)? Simply finish the rest of the work manually. Yes, you heard me! No automation anymore. Just good old manual labor. But you know what – it can be quite expensive. For example, a year ago we helped one of our customers reduce the time needed to “finish” their metadata project from four months to just one week! They were ready to invest the time of five smart guys, four months per person, to manually analyze hundreds and hundreds of BTEQ scripts, extract metadata from them and store them on the existing metadata tool. In the United States, we typically meet clients with several hundreds of thousands of database scripts and stored procedures. That’s sooo many! Who is going to pay for that? The vendor? The system integrator? No, you know the answer. In most cases, the customer is the one who has to pay for it.

Know Your Limits

I have been traveling a lot the last few weeks and have met a lot of people, mostly investors and industry analysts, but also a few professionals. And I was amazed by how little they know about the real capabilities and limitations of existing solutions. Don’t get me wrong, I think those big guys do a great job. You can’t imagine how hard it is to provide a really easy-to-use metadata or data governance solution. There are so many different stakeholders, needs and requirements. I admire those big ones. But it should not mean that we close our eyes and pretend that those solutions have no limitations. They have limitations and fortunately, the big guys, or at least some of them, have finally realized, that it is much better to provide open API and to allow third parties like Manta to integrate and fill the existing gaps. I love the way IBM and Collibra have opened their platforms and I feel that others will soon follow.

How can you protect yourself as a customer? Simply conduct proper testing before you buy. Programming code in BI is ignored so often, maybe because it is very low-level and it is typically not the main topic of discussion among C-level guys. (But there are exceptions – just recently I met a wonderful CDO of a huge US bank who knows everything about the scripts and code they have inside their BI. It was so enlightening, after quite a long time.) It is also very hard to choose a reasonable subset of the data environment for testing. But you must do it properly if you want to be sure about the software you are going to buy. With proper testing, you will realize much sooner that there are some limitations and you will start looking for a solution to solve them up-front, not in the middle of your project, already behind schedule and with a C-level guy sitting on your neck.

It is time to admit that marketing leaflets lie in most cases (oh, sorry, they just add a little bit of color to the truth), and you must properly test every piece of software you want to buy. Your life will be much easier and success stories of nicely implemented metadata projects won’t be so scarce.

Originally published on Tomas Kratky’s LinkedIn profile.

MANTA 2016: A New Home, New Challenges and a New Name

December 30, 2016 by

Only a few days are left. 2017 is knocking on our doors harder and harder. We, at MANTA, had a wonderful year, a really amazing one.

Only a few days are left. 2017 is knocking on our doors harder and harder. We, at MANTA, had a wonderful year, a really amazing one.

A New Home

So many things have happened. Let’s start where it ends. Just a few weeks ago we finally found a new office, a wonderful place on I.P. Pavlova square, very close to the historic center of Prague. It was so hard; it took more than six months to find something really appropriate. But this place is cool – under the roof, a lot of space and light and energy. It will become our new home, and we are very happy about it. It is our Christmas gift. The best place for the best team.

New Challenges

Data governance, metadata, and especially data lineage became very hot in 2016. There are many reasons why that happened, but we are excited to see how companies both in the US and Europe have started their governance programs. Also, vendors have become more active in this space and a lot has changed on the metadata market map. But what has remained the same is the need to harvest metadata in an automated way from many different kinds of systems, both old and new, most of the time implemented in the form of programming code.

And while this is a weak spot for all well-known metadata players, MANTA shines when it comes to code. So, that has led us to more and more partnerships with various technology vendors. We are well known for our integration with IMM, but as of this year we are also partners with two other market leaders – IBM and Teradata. We also support Microsoft SQL Server, and a few more additions to our technology stack are planned for the first months of 2017.

New Business and a New Name

But there are also other things that have made us happy. We have continued on our US journey and signed on several new customers there, some of them really huge like one of the TOP4 US banks and one of the TOP5 mutual funds in the world. It is not a surprise that we have been growing super fast. We tripled our revenue in 2016 and our pipeline has simply exploded.

This leads us to another major change – we have decided to drop the “Tools” from our name and became only MANTA. Everybody was calling us that anyway, so why resist the urge? We’ll tell you more about it in the first week of 2017.

But the best part was getting feedback from our customers. They like what we do and how we do it. And that has been the greatest message for the whole MANTA team. It keeps us moving forward and working hard to exceed their highest expectations.

Thank you for everything in 2016 and see you in 2017!

Metadata as Explained to My Grandma

September 25, 2016 by

Our CEO Tomas Kratky explains to his grandma what we do: “Metadata management is simply cartography.” 

Our CEO Tomas Kratky explains to his grandma what we do: “Metadata management is simply cartography.” 

A few weeks ago, I talked to my grandma. I travel a lot and she lives quite far away, so we hadn’t seen each other for a long time. She is an old lady, very kind and also very curious about what I do and how I live. She had many questions (half of them about kids which I don’t have yet) but everything went well till she asked about Manta Tools, the company I run. “Tomas”, she said, “tell me more about what you do for a living. You do computers, don’t you?”

I am used to talking about my vision, about what we do as a company and why, with so many people – our team, our clients, our partners, our investors. But none of them are like my grandma. The most advanced thing she can imagine is how to turn on her television. She doesn’t care about big data or any other similar buzzwords, so I was absolutely unable to say even a single word. When I left her room, I knew that I had failed badly.

I spent several weeks thinking about the best way to describe the world I live in, and this is the result. I haven’t used it yet; I wanted to share it with you, other professionals. I hope you’ll find it useful and helpful. And I would be glad to hear your feedback to further improve my story. In the end, I want to make my grandma proud of me.

You Need Information…

There are many huge companies like banks, telcos and utilities around us. Let’s imagine that one huge company is a country, like the United States, for example. Such a company provides a lot of services to its customers and runs a lot of processes. To do that efficiently, they need a lot of IT systems like online banking and accounting. And there are also many other systems which are owned by other companies but are somehow visible and even accessible to our example company, like weather forecasts and social networks. All those systems are valuable sources of information – about clients, about partners, about the environment, about our own performance, etc. Information is our gold today and it always has been. You are stronger when empowered with the right information delivered at the right time. So if information is gold, those systems I’ve just mentioned are gold mines.

…To Make Decisions

The main purpose of information is to use it to make decisions. After all, we live in a data-driven world. Let’s pretend that decision makers are dukes, lords and kings living in that country. They own the gold, for sure, but to use it, we must dig it up and transport it to them. And you know what? This is exactly what business intelligence is about, isn’t it? The mines are data analytics in the real world; the trucks and roads are ETLs (or data processing).

So, we have a great country with gold everywhere and lords eager to get it. And we have a way to find, dig up and deliver the gold to them. But, just as in the real world, our country is large with deep valleys, fierce rivers, steep mountains, dark woods, hot deserts, and insidious swamps. Delivering the gold from the mines to the lords is sometimes a very dangerous job, as you can imagine. We can mistake rocks for gold, or we can damage it (this is why everybody runs data quality programs). But it is also dangerous for us – we can easily get lost if we choose to follow the wrong road. And so, we need maps.


They allow us to share with others the knowledge of one brave Indiana Jones, the very first discoverer of that road, about the right direction and all associated dangers. Those maps are metadata, ladies and gentlemen, and metadata management as a discipline is simply cartography.

Lords always want more gold. They are born with this obsession. So they continuously invest into better and larger mines, wider and safer roads, and faster and bigger wagons to deliver as much gold as possible in the safest way possible. We have great highways now (yes, I mean those wonderful graphical ETL tools like Informatica PowerCenter, IBM DataStage, and Talend) which are very easy to discover and map for any cartographer, regardless of their level of experience.

But with all those rivers, mountains, swamps, and deserts it is simply not smart or even possible to build highways everywhere. Sometimes we just need a way to quickly walk through a dark forest, and building an expensive highway is not a good choice if we don’t understand what’s on the other side (yes, I am talking about data science!). Sometimes we need to build a bridge across a river or a road through a dangerous swamp. So you need special construction to do that (now I am talking about programming code, like SQL for example, which allows us to overcome almost any engineering difficulty). Unfortunately, mapping complex bridges, desert roads, and special roads through swamps is a very tricky job for any cartographer.

The Exponential Growth

What is even worse, in the last few years the number of mines, roads, and wagons of all kinds has grown exponentially. We build them everywhere and usually there is not enough time to do it according to a bigger plan. It is a huge mess today. Detailed, accurate maps are needed more than ever before. We can get really lost without them. All our gold can be lost forever, in just a second.

Fortunately, there are several great and popular companies doing cartography – Informatica, IBM, Adaptive, Collibra, and Alation, to name a few. But they focus more on how their maps look, how easily they can be used, and how understandable they are for drivers. Unfortunately, they don’t map more dangerous roads (programming code in the form of various SQL scripts or stored procedures) correctly. So the maps they produce are rather incomplete with a lot of “here be dragons” areas, blank spots. And this is exactly where Manta Tools comes in.

We are the Indiana Joneses, the adventurers of modern cartography. We discover dangerous roads where others are afraid to go. We map them and help the bigger cartographers produce as complete and precise maps as possible.

With Manta Tools, there are no “here be dragons” areas anymore.

And all the precious and shiny gold of our lords is much safer with us!

How the API Economy Changes the B2B World

July 31, 2016 by

Have you heard about the API economy? I’ll bet your answer is yes. It is an even buzzier buzzword than #BigData. (OK, not really, but it’s almost there!)

Have you heard about the API economy? I’ll bet your answer is yes. It is an even buzzier buzzword than #BigData. (OK, not really, but it’s almost there!)

Almost every company is now building its public API, trying to open up a little bit to allow its users to access existing services in a programmatic way, not just by working through GUI. Hundreds of mashups (combinations of services) pop up every day. Thousands of new applications (mainly mobile) exist only due to this phenomenon. But is this really something so new?

Not at all. We’ve had all these things for so many years. B2B has been about programmatic integration from the very beginning. The core principles (technology, algorithms) are still the same, but the market size/opportunity is what differs now. Just think about mobile phones – this great and simultaneously sometimes kinda lame (for all those unhappy users of non-Apple phones) piece of equipment is so widespread that, together with widely accessible Internet, it defines a completely new market.

Now it finally makes sense to create public application services because there is such a big chance someone will use it (and maybe even pay for it – that’s why we call it “an economy”). And with a growing number of consumers, the group of producers is also growing. In the end, it really leads to better user experience and satisfaction in many cases. As with all other buzzwords, there is something to it.

What’s up with the API Economy?

I (and everybody else) like the idea of openness and smooth integration. And with a growing number of application services published by highly regulated and conservative companies like banks, I am becoming more and more disappointed with the world of traditional data management tools. As you know, our product Manta Flow is a great solution when you need to understand what is inside your SQL-like code. We are highly specialized in custom code in BI environments, and as such we mostly find ourselves in situations where our customer is interested in extracting valuable metadata from their BI programs and applications and loading it into their existing metadata manager. Why?

Simply because most tools ignore those complex parts of the BI environment like stored procedures, scripts and other kinds of code. This means that the data lineage is not complete, it’s not easy to do impact analyses, and you waste a lot of money on a tool which gives you so little. But to be able to integrate, we need to find a way to load metadata extracted by Manta Flow into an existing tool and combine it with the metadata already stored inside. And now, ladies and gentlemen, the real fun begins.

It’s Not Easy to Be API-Ready

Usually every vendor has some kind of API to support integration. A nice group of well designed services exists with amazing documentation. So you try it. And you fail. So you try it a different way. And you break something. And because you have to come up with a solution to satisfy a customer, you continue – and it hurts. Believe me, we go through this painful process every damn day. How is it possible? Simply because those big monolithic tools were not designed to support integration with other software applications. I understand! To design and develop a big and really pluggable software system is not an easy task, even for experienced engineers. But so many things are broken! Some tools try to analyze SQL code but with bad results. Surprisingly, there is no easy, standardized way to replace wrong metadata without breaking links to other parts of the repository.


Some tools support internally many different kinds of metadata structures but offer only a very limited part of their capabilities in the form of API. That means you are forced to work directly with the existing structures in the repository! Some tools have serious performance issues when loading external metadata through their API. There is also one data management product infamously known for its strict legal protection against any integration. Trying to integrate with that tool is like trying to open the doors of a dark dungeon. Such a project can be so painful that the customer decides to stop integration entirely. Does that sound like the 21st century to you? Not to me.

New Hope

But this cloud has a silver lining as several traditional vendors have announced massive redesigns. Also, young and more focused players, like business glossaries, are better designed for integration from the very beginning. (Some great examples are Collibra and Diaku Axon, but there are many more of them.) We implemented several integration scenarios requested by our clients with some of these young stars, and you know what – everything went surprisingly well. It seems that finally, after so many years, the API approach is also entering into our beloved data management world.

And that is great not just for other-than-huge vendors like us, but especially for customers. Metadata is a very complex beast to tame and no product can become the ultimate solution for it. Great API support allows you to combine several great tools to get a more complete solution. And that’s what matters the most.

Long live the API economy!

Any questions or comments? Just let Tomas know at or via the contact form on the right. 

Agile BI Development in 2016: Where Are We?

Agile development was meant to be the cure for everything. It’s 2016 and Tomas Kratky asks the question: where are we?

Agile development was meant to be the cure for everything. It’s 2016 and Tomas Kratky asks the question: where are we?

BI departments everywhere are under pressure to deliver high quality results and deliver them fast. At the same time, the typical BI environment is becoming more and more complex. Today we use many new technologies, not just standard relational databases with SQL interfaces, but for example NoSQL databases, Hadoop, and also languages like Python or Java for data manipulation.

Another issue we have is a false perception of the work that needs to be done when a business user requests some data. Most business users think that preparing the data is only a tiny part of the work and that the majority of the work is about analyzing the data and later communicating the results. Actually, it’s more like this:


See? The reality is completely different. The communication and analysis of data is that tiny part at the top and the majority of the work is about data preparation. Being a BI guy is simply a tough job these days.

This whole situation has led to an ugly result – businesses are not happy with their data warehouses. We all have probably heard a lot of complaints about DWHs being costly, slow, rigid, or inflexible. But the reality is that DWHs are large critical systems, and there are many, many different stakeholders and requirements which change from day to day. In another similar field, application software development, we had the same issues with delivery, and in those cases, agile processes were good solutions. So our goal is to be inspired and learn how agile can be used in BI.

The Answer: Agile?

One very important note – agility is a really broad term, and today I am only going to speak about agile software development, which means two things from the perspective of a BI development team:

1. How to deliver new features and meet new requirements much faster

2. How to quickly change the direction of development

Could the right answer be agile development? It might be. Everything written in the Agile Manifesto makes sense, but what’s missing are implementation guidelines. And so this Manifesto was, a little bit later, enriched with so-called agile principles. As agile became very popular, we started to believe that agile was a cure for everything. This is a survey from 2009 which clearly demonstrates how popular agile was:


Source: Forrester/Dr. Dobb’s Global Developer Technographic, 2009

And it also shows a few of the many existing agile methodologies. According to some surveys from 2015, agile is currently being used by more than 80% or even 90% of development teams.

Semantic Gap

Later on, we realized that agile is not an ultimate cure. Tom Glib, in his famous article “Value-Driven Development Principles and Values” written in 2010, went a bit deeper. After conducting a thorough study of the failures, mistakes, and also successes since the very beginning of the software industry, one thing became clear – there is something called a semantic gap between business users and engineers, and this gap causes a lot of trouble. Tom Glib hit the nail on the head by saying one important thing: “Rapidly iterating in wrong directions is not progress.” Therefore, the requirements need to be treated very carefully as well.

But even with the semantic gap issue, agile can still be very useful. Over the last ten years the agile community has come up with several agile practices. They are simple to explain things that anyone can start doing to improve his or her software processes. And this is something you should definitely pay attention to. Here you can see agile practices sorted by popularity:


If you have ever heard about agile, these are probably no surprises for you. The typical mistake made by many early adopters of agile was simply being too rigid; I would call it “fanatic”. It was everything or nothing. But things do not work that way.

It’s Your Fault If You Fail

Each and every practice should be considered a recommendation, not a rule. Your responsibility is to decide if it works for you or not. Each company and each team are different, and if system metaphor practice has no value for your team, just ignore it like we do. Are you unable to get constant feedback from business users? Ok, then. Just do your best to get as much feedback as you need.

On the other hand, we’ve been doing agile for a long time, and we’ve learned that some practices (marked in red) are more important than others and significantly influence our ability to be really fast and flexible.


There are basically two groups of practices. The first group is about responsibility. A product owner is someone on your side who is able to make decisions about requirements and user needs, prioritize them, evaluate them, and verify them. It can be someone from the business group; but this job is very time consuming, so more often the product owner will be the person on your BI team who knows the most about business. Without such a person, your ability to make quick decisions will be very limited. Making a burndown list is a very simple practice which forces you to clearly define priorities and to select features and tasks with the highest priority for the next release. And because your releases tend to be more frequent with agile, you can always pick only a very limited number of tasks making clear priorities vital.

The second group of critical practices is about automation. If your iterations are short, if you integrate the work of all team members on a daily basis and also want to test it to detect errors and correct them as early as possible, and if you need to deliver often, you will find yourself and your team in a big hurry without enough time to handle everything manually. So automation is your best friend. Your goal is to analyze everything you do and replace all manual, time-consuming activities with automated alternatives.

What Tools To Use?

Typical tools you can use include:

1. Modern Version Control Systems

A typical use case involves a GIT, SVN, or Team Foundation Server storing all pieces of your code, tracking versions/changes, merging different branches of code, etc. What you are not allowed to do is use shared file systems for that. Unfortunately, it is still quite a common practice among BI folks. Also, be careful about using BI tools which do not support easy, standard versioning. Do not forget that even if you draw pictures, models, or workflows and do not write any SQL, you are still coding.

So a good BI tool stores every piece of information in text-based files – for example XMLs. That means you can make them part of a code base managed by GIT for example. A bad BI tool stores everything in binary and proprietary files, which can’t be managed effectively by any versioning system. Some tools support a kind of internal versioning, but those are still a big pain for you as a developer and they lead to fragmented version control.

2. Continuous Integration Tools

You’ll also need tools like Maven and Jenkins or PowerShell and TeamCity to do rapid and automated build and deploy of your BI packages.

3. Tools for Automated Code Analysis and Testing

I recommend using frameworks like DB Fit at least to write automated functional tests and also using a tool for static code analysis to enforce your company standards, best practices, and code conventions (Manta Checker is really good at that). And do not forget – you can’t refactor your code very often without proper testing automation.

4. Smart Documentation Tools

In the end, you can’t work on the parts of your system you do not understand. The best combination of tools you can get is something like Wiki to capture basic design ideas and a smart documentation tool able to generate detailed documentation when needed in an automated way. Today there are many very good IDEs that are able to generate mainly control-flow and dependency diagrams. But we are BI guys, and there is one thing that is extremely useful for us – it is called data lineage, or you can call it data flow.

Simply put, it’s a diagram showing you how data flows and is transformed in your DWH. You need data lineage to perform impact analyses and what-if analyses as well as to refactor your code and existing data structures. There are almost no solutions on the market which are able to show you data lineage from your custom code (except our Manta Flow, of course).

And that’s it. Of course, there are some other more advanced practices to support your agility, but this basic stuff is, I believe, something which can be implemented quickly from the perspective of both processes and tools. I definitely suggest starting with a smaller more experienced team, implementing the most important practices, playing around a little bit, and measuring the results of different approaches. I guarantee that you and your team will experience significant improvements in speed and flexibility very soon.

Do you have any questions or comments? Send them directly to Tomas Kratky at! Also, do not forget to follow us on Twitter and LinkedIn

Subscribe to our newsletter

We cherish your privacy.

And we need to tell you that this site uses cookies. Learn more in our Privacy Policy.