Engineer’s Notes

How to Handle Impact Analyses in Complex DWHs with Predicates

“How to get full data lineage in complex BI environments and perform reliable impact analyses?” Predicates (with the help of Manta Flow!) might be the answer. 

“How to get full data lineage in complex BI environments and perform reliable impact analyses?” Predicates (with the help of Manta Flow!) might be the answer. 

During our pilots and deployments, we often find data warehouse environments that use very general physical models including several big tables like PARTY, BALANCE, ORDER and others. These tables contain data obtained from various source systems, and there are a lot of data marts and reports built on top of them. These tables make things difficult during the impact analysis because data lineage from almost every report goes through them to all sources making the result worthless.

Impact Analyses Do Not Have to Be THIS BIG 

Let’s take a look at an example to understand exactly what happens. The table PARTY contains all individuals and companies that are somehow related to the organization. Thus, in one table, it is possible to have records for clients, employees, suppliers and its branch network. Each type of entity is identified by a unique attribute or source system from which data is obtained – for example, clients are managed in a different system than employees.

Now, let’s assume we have two reports based on data from the PARTY table – a report EMPL_REPORT that displays information about employees and another report BRANCH_REPORT that displays information about the branch network. If we use the standard data lineage analysis, we can get this picture:

predicates1

Although only data from the EMPLOYEE source table is relevant for the report EMPL_REPORT, the impact analysis from that report also includes the CLIENT, BRANCH and SUPPLIER source tables due to the PARTY table. The problem is the same for the report BRANCH_REPORT. From the other side, the impact analysis from the EMPLOYEE source table includes both the EMPL_REPORT and BRANCH_REPORT which is confusing.
In the real environment, there are dozens of source systems and hundreds of reports, which makes the standard data lineage analysis worthless.

The Advanced Data Lineage Analysis 

Fortunately, there is a solution. When data is inserted into the PARTY table from different source systems, there is often a column like PARTY.source_system_id where the identification of the source system is stored as a constant. Similarly, when a report is created that consumes data only from specific source systems, there is a condition in the statement filtering data based on the PARTY.source_system_id column. Thus, it is possible to automatically analyze both the insertion and selection to/from the PARTY table and create predicates such as PARTY.source_system_id = 20 that are then stored together with data lineage in the metadata repository. Therefore, it is possible to include them in the computation during the impact analysis.

Thanks to that, if we perform an impact analysis from the report EMPL_REPORT, the predicate PARTY.source_system_id = 20 is gathered before the table PARTY. When the analysis continues towards source tables, the predicate for each path is selected and compared to what has already been gathered. Therefore, when the path to the source table CLIENT with the predicate PARTY.source_system_id = 10 is tested, the result is that both predicates cannot hold at once, so data for this report cannot come from this source table. Conversely, when the path to source table EMPLOYEE with the predicate PARTY.souce_system_id = 20 is tested, the result is that data for this report can come from this source table, so it is included in the result of the impact analysis. We can get similar results if we perform an impact analysis for the BRANCH_REPORT and also from sources like the EMPLOYEE table.

The result of the advanced data lineage analysis can look like this (in reality, if we perform the impact analysis from the EMPL_REPORT, we will only see the EMPLOYEE and PARTY tables):

predicates2

Surely, the situation can be far more complex. For example, the data from the PARTY table can be pre-computed for more source systems first, and then several reports can be created on top of them for only a specific source system, like in this picture:

predicates3

This is also something that can be handled and, as you may have expected, even this is a part of the Manta Flow product analysis.

If you have any questions or comments, feel free to contact Lukas at manta@mantatools.com. You can try these predicate-based impact analyses in our free trial – just request it using the form on the right. 

 

How To Inspect Raw Data Lineage With Manta Flow

Risk departments have a lot of complex SQL queries in their data warehouses and data marts. But sometimes it’s really difficult to find the right level of detail. Manta Flow can help.

Risk departments have a lot of complex SQL queries in their data warehouses and data marts. But sometimes it’s really difficult to find the right level of detail. Manta Flow can help.

“When we present Manta Flow to potential customers, most of them are happy that we can reduce very complex SQL statements to a few simple rectangles connected by arrows”, explains Lukas Hermann, our Director of Engineering. “They need to be able to quickly understand what source tables their SQL queries read, what target tables they fill, what columns are involved in computing a particular column and how.”

The Usual

For example, let’s look at just two ordinary insert statements moving data from a stage to a datamart and to a report:

raw_code

It could take you quite a while to analyze which columns are involved in the computation. But with Manta Flow it is really easy to see, including all the statements involved:

Direct

(click to show the large version in a new tab)

This is perfectly sufficient for all business analytics in data warehouse environments. All the unnecessary details like exactly how data is computed, filtered, aggregated, or ordered are hidden. And if you want to go deeper, Manta Flow can easily show the SQL code of the statements where you find the full detail.

The Raw

However, some analysts (particulary from the aforementioned risk departments) say that their SQL statements are really huge, including many subselects, complex expressions, etc., so the jump between the clear picture and the SQL code is too big. Therefore, they would like to see all the computation steps in a similar simplified format, and they ask if Manta Flow can handle it, if it has all the information necessary to show it.

The answer is that Manta Flow has the most detailed information possible about each part of the statement, but so as not to disturb you with what are in most cases useless details, it filters the information to the best level of detail. If you want to see everything including expressions, conditions, aggregations, etc., it’s possible to configure or completely turn off the filtering. Manta Flow is able to show you unfiltered information, but still keep you in the loop and oriented within your own systems.

click_to_see_the_big_picture

(image will open in a new tab)

See? It’s possible to show the SQL code in the precise position of each part of the statement shown.

If you’d like to try something like that yourself, just let us know in the form on the right. Also, do not forget to follow us on Twitter.

Manta Tools 3.12: The New Versions of Oracle & Informatica, a New Demo, and More!

Surprise! Another Manta Tools release here a bit early. And what’s in the box?

Surprise! Another Manta Tools release here a bit early. And what’s in the box?

Our customer base is growing quickly, and feature requests are piling up. That’s why Lukas (our Codehead Prime) decided to squeeze one more major release into the schedule. So, what’s new?

We need to catch up on the development of technologies we support, that’s one of the keys to our success. We generally support the latest versions, but the guys did a thorough review and declared Oracle 12c and Informatica 10 (namely Informatica PowerCenter and Informatica Metadata Manager) fully supported. Although most of the market still has older versions, both Oracle and Informatica are pushing people really hard to switch to newest releases. Informatica even announced the end of support for 9-something versions next year.

Manta Flow, in particular, received a few major updates (minor updates are not worth mentioning due to the very fact that they are minor). REST API is now available in Manta Flow. Our native visualization can now easily show SQL overrides, and individual nodes (and their children) are now easy to contract with just one click. What does it mean to “contract nodes,” you ask? I thought you would ask, so I made an animated GIF about it! Check it out:

node_contraction

See? Big fat script in the middle got out of the way and you can see just tables you wanted. Notice the line from one table to another is thicker, so you can click on it and show the script again. 

And last, but not least: Manta Flow online demo was updated, and it now corresponds with the latest version of Manta Flow you can get off the shelf. There are some limits to it though – as with any cloud product, it does not have access to your system and it cannot show full data lineage (only the part you push in). But try it anyway, it’s awesome. (Or ask for a free trial right away, and test Manta Flow at home.)

Did you know we are on Twitter? Yes, it’s true. Follow us, please. Also, do not hesitate to ask any questions and submit any comments via email to manta@mantatools.com. 

 

 

 

Manta Tools 3.11: Increased Compatibility, Improved Visualization & More

Compatible with yet another data source & offering more user-friendly visualizations, that’s Manta Tools 3.11.  

Compatible with yet another data source & offering more user-friendly visualizations, that’s Manta Tools 3.11.  

So, what’s new? Manta Flow’s visualization export functions have been enhanced significantly – you can now filter resources and change the level of detail on which the lineage is extracted. Also, native Manta Flow visualization now shows SQL override queries in Informatica PowerCenter as a parameter.

For the even faster automated visualization settings, Manta Flow now supports selection via text input (such as CSVs). This is especially useful when a customer with a huge environment would like to visualize a very specific part of it. And, as always, Teradata & Oracle parsers are now better than ever.

And last, but definitely not least, is our biggest new “feature” – compatibility with Microsoft SQL Server. But this topic is waay to big for a humble release blog post like this one. We will announce it in separate article tomorrow.

Any questions, comments or concerns? Just send them to manta@mantatools.com. And do not forget to follow us on Twitter!

Manta Tools 3.10: Unlimited Time Machine, Automated Backup & More

The new version is out – what is new with Manta Tools 3.10?

The new version is out – what is new with Manta Tools 3.10?

One of the all time most requested Manta Flow features was the unlimited time machine function. It is now possible to see every visualization you’ve ever performed on your system and freely explore them like you would with current ones. Manta Tools’ metadata repository is also automatically backed-up whenever you like, so even your most paranoid teammates can sleep at night.

When it comes to technology, we’ve improved metadata extraction from Teradata databases and the predicate system for Oracle and Informatica. The Manta Flow visualization itself has also been updated. We’ve added extended highlighting (you can try this in our online demo!) and also a direct export option on the customer’s catalog page.

And what’s coming in our next version? We will finally release support for a whole new technology – Microsoft SQL Server, including a connection with Informatica Metadata Manager.

Do you have any questions or suggestions? Please let us know at manta@mantatools.com, and do not forget to follow us on Twitter.

 

 

Manta Tools’ New Customer in Germany: DIY Retailer OBI

We have our first customer in Germany – and it’s the most famous beaver in Europe!

We have our first customer in Germany – and it’s the most famous beaver in Europe!

Germany is one of the most promising markets for us in the EMEA region (along with the rest of Western Europe), and we pushed strongly there parallel to our expansion to the United States. We managed to get a few experienced partners in Germany, attended the TDWI conference in Munich, and our joint efforts have paid off.

At the end of the summer, we signed up our first customer in Germany: OBI, the biggest DIY retail store in Europe and the third largest on the entire planet. We were happy to provide them with Manta Flow for Teradata, capable of analyzing all the data flows of their whole Teradata-based environment, and our own end-to-end data lineage visualization (you can try it for yourself in our demo center).

“When I first talked with my team in the beginning of 2015 about Germany being one of the top target markets for us, I believed we would be able to close the first deal there before the end of the year,” says Tomas Kratky, CEO of Manta Tools. “We slowly built a network of hard-working sales reps and enthusiastic partners, started targeted marketing campaigns, and voilà, we closed the first deal by the end of this year’s sunny summer. And our task is simple – to get other customers and to get them fast.”

We are preparing a joint case study, so stay tuned for details!

Do not forget to follow us on Twitter, LinkedIN, or the German LinkedIn-like social network XING

 

 

 

Manta Tools 3.9: Time Machine, Role Management & More

It’s been a while, but a new release of Manta Tools is here! And what’s new in version 3.9? 

It’s been a while, but a new release of Manta Tools is here! And what’s new in version 3.9? 

As usual, a lot. Manta Checker is introducing advanced role management – it is used in bigger BI/DWH teams now and we needed to comply with all the enterprise security policies. Admin/configuration roles are now separate from user/engineer roles, and compliance officers now have broader control over what’s happening in Manta Tools. The rules used to enforce standards and policies in Checker now have three different categories – pre-validation, validation, and post-validation rules. That’s crucial for most of the complex transformation scenarios – Tomas Fechtner is gonna explain this feature more in a separate article.

And last but not least – for the very first time, we are introducing something called incremental parsing. Manta Checker’s performance is increased by parsing only parts of code when necessary, thereby making transformations faster and more accurate.

Manta Flow has also been vastly improved, mostly on the grammar side – we improve grammar in every release, but this time it was Oracle who received the biggest update. Also, Manta Flow for Oracle received a major update of its database links. But the biggest and most important function is the Manta Flow time machine function. The simple ability to work with the previous version of visualization gives users even more control over what is happening in their data warehouse.

Up Next in 3.10

In the next version, Manta Tools 3.10, we are planning on expanding the powerful time machine function (and making it possible to compare the current version with any previous version in the history of the universe) and with both products, we will focus on Teradata a little more next time.

Comments? Questions? Let us know at manta@mantatools.com or use the form on the right.

Manta Flow + SAP PowerDesigner: A New Metamodel Mapping Platform

In a previous article, Lukas Hermann explained how PowerDesigner and Manta Flow can work together. Now, our Senior Developer Jiri Tousek will explain all the technical details. 

In a previous article, Lukas Hermann explained how PowerDesigner and Manta Flow can work together. Now, our Senior Developer Jiri Tousek will explain all the technical details. 

Basically, we’ve created a general procedure for importing PowerDesigner’s metadata to Manta Flow for various purposes, e.g. advanced impact analyses. It is a simple, two-step process:

Step 1: Get Metadata from PowerDesigner

First, we need to export/access the PDM file and extract its metadata as an XML file (our tool also supports batch processing of multiple files). The metadata is easily mapped to physical database objects as long as the physical name (“code” field) is filled in correctly and the correct model object type is used. PowerDesigner’s own API is responsible for two-way compatibility between different versions of PD. We use API because PDM file formats have changed a lot between different PD versions. Each model is extracted with standard and also extended attributes in all packages, tables, views and columns.

Step 2: Mapping Platform

Our mapping platform supports metadata in any general XML file. There’s only one requirement: the XML has to contain physical data model names and the object types of the database objects being mapped. Our configuration is quite flexible though. It’s XPath-based and can support pretty much any metadata scheme. It goes without saying that you can choose what metadata to extract from the input and that we also support translation of the metadata attribute names (e.g. from technical identifiers to human-readable attribute labels).

Benefits

So, this is a short summary of how we export metadata from PowerDesigner and import it to Manta Flow. And how can you benefit from this in your daily work? Well, there are a bunch of ways, but this triplet comes to mind right now:

1) In Manta Flow, you can see data flows all together with attributes.

2) Pretty much everything written in this article.

3) You can easily create an impact analysis regrading those attributes. This is probably the most interesting, so let’s illustrate this with a use case – security impact analysis:

Imagine that you need to investigate how your sensitive data propagates through your system. And obviously, you need to compare the real security status with the rules, guidelines, and procedures for your organization. That’s not something you can just wave off, and this solution will make your work doing this a lot easier. In order to perform this analysis, you need to define sensitivity levels for (some of) your database objects. By importing those from PowerDesigner, you can avoid manually setting up administration rights for sensitivity levels. Using the policies you already have in place for PD will save you time as you avoid bureaucratic hassles.

Jiri Tousek would love to hear your feedback on this – let him know what you think at manta@mantatools.com and follow us on Twitter

Benchmark: New Elements Per Commit in TitanDB

In our last blog post, we described how to optimize the process in TitanDB by adding the proper index.

In our last blog post, we described how to optimize the process in TitanDB by adding the proper index.

Today, we would like to illustrate the importance of finding the ideal number of new elements per commit in the TitanDB because it could have an impact on the performance.

First, we will describe the algorithm which the proper size of commit is looking for. Basically, it is graph merging. The newly inserted sub-graph is compared to the existing big graph in the database. The algorithm tries to find common parts and creates only vertices and edges which do not exist in the database yet. To summarize, in addition to creating new vertices and edges, it fetches vertices, compares attributes and traverses graphs.

The measurement was performed very simply. We had the algorithm from the past, so we only needed to change the number of newly created elements (both vertices and edges together) per commit. Before the test, we searched the Internet for the proper size of commit for TitanDB and found that it should be something about 10k elements per commit.

For the test we used two data samples, the same as in the previous post about indices. The first one was part of a real data warehouse with about 1M vertices and 2M edges. The second one was an artificial sample with 50K vertices and almost no edges, but with the super node problem.

You can see the results in the table:

tabula rasa

The best result for our algorithm was only 500 new elements per commit, which is 200-times smaller than the universal advice. And the difference in time spent was about 30%, which is pretty nice. The artificial sample needed an even smaller commit size and the difference in performance was greater (91%). Why are our results so different? We think the cause was the nature of our algorithm, in which the creation of new objects is only one part of all database operations.

The main conclusion we can draw from this test is that the size of the commit can make a significant difference. We definitely recommend performing similar tests to find your ideal commit size.

The tests were done on a computer with the following configuration:

OS: Win 7 64-bit
Processor: Intel Core i7-3740QM; 8 CPUs; 2,7 GHz
RAM: 8GB DDR3 (1600MHz)
Disk: SSD

Java 1.7.51
JVM: 2GB RAM

Titan: 0.4.4
Backend Storage: Persistit
Db cache size: 0.6

Any comments on this article? Just let us know in the form on the right. Also, do not forget to subscribe to our blog feed

Benchmark: Vertex Centric Indices in TitanDB

Our dev team’s second in command, Tomas Fechtner, prepared a short benchmark on searching a handy index in TitanDB. 

Our dev team’s second in command, Tomas Fechtner, prepared a short benchmark on searching a handy index in TitanDB. 

Index is an essential component of a database design. One of its main purposes is to speed up queries which would be unusable without indices. This premise applies to traditional relational SQL/NoSQL databases, as well as to graph databases, such as, for instance, Titan which was our choice for our metadata storage. In this post, we would like to share with you an issue and our solution, concerning graph database indices.

As we already mentioned in the previous post, we had chosen Titan as the metadata storage database in Manta Tools. Apart from other elements, the data module also includes an ancestor/descendant relationship, e.g., for representing table relevancy in the diagram. In the graph domain, we can visualize it as a couple of vertices connected by an edge. Among others, the vertices will have a name attribute and the edge will has a hasParent label, going from descendant to ancestor.

The described part of the database data module is used for two approaches

  • we know the descendant object and want to know the ancestor object;
  • we know the ancestor object and descendant name and want to obtain the descendant object.

A descendant has a maximum of one ancestor, thus, in terms of performance, case no. 1 is not of our interest. On the other hand, an ancestor can have many descendants, i.e., the standard relationship of 1:N. The vertex name is not unique across the entire database. However, to simplify, let’s assume that a vertex can only have a single descendant with a particular name.

First of all, we are going to discuss the situation and why the index is required in general. We just need to go through all descendants of a particular ancestor, comparing their names with the one searched. Such traverse only involves a single edge and, in the development environment, everything can function quickly and seamlessly. The issue arises when, for instance, a vertex has tens of thousands of descendants, which can very well happen. Imagine, for example, a diagram and the number of its tables/views/procedures in a really large warehouse. So, an index would definitely come handy here. But which one?

The first, somewhat naive approach would involve selected an index as the name attribute. When querying, all vertices with the particular name would then be obtained, being subsequently looked through, while filtering those that are connected to the particular ancestor by an edge. This approach hits the wall when, in the entire graph (with millions or even tens of millions of vertices), there are several thousand nodes with the same name. The algorithm has to go through all of them, verifying their ancestors. In the execution plan language used for SQL databases, the particular edge is used as a filter, rather than an access condition.

With respect to the fact that, upon querying from the ancestor, the incident edges must be selected first while, then only, the algorithm gets to the vertices at their other ends, it is advisable to limit the fetched edges, For that reason, we have introduced a new edge attribute, the childName. As its name suggests, the attribute contains the name of the descendant at the end of the edge. Using this attribute, the index can be implemented, thus significantly limiting the actual disk operations for the edge selection, while subsequently fetching information about the vertices connected.

With this redundant information, which is, of course, reflected in increasing memory requirements, we have significantly sped up the process. However, for certain specific cases, the solution still did not suffice. We only arrived at an ideal solution by including the so-called vertex-centric indices. These are indices that are implemented for particular vertices; in other words, each vertex has its own index tree, unlike the standard indices which are common for the entire graph. Apart from other aspects, this also significantly shortens the leaf node chain.

tabulka od toma

Set 1 – 50k nodes with one super node; an artificial sample
Set 2 – 1 193k nodes without super node; complete data warehouse of medium size

Based on this experience, we would like to suggest that you think about all three queries for the 1:N type relationships. Are you sure they cannot develop the super node problem? Fortunately, Titan provides a tool to tackle this problem. However, it is still up to developers and analysts to identify the problem situations and apply the aforementioned tools.

Further reading

A Solution to the Supernode Problem: http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/

Configuration

OS: Win 7 64-bit
Processor: Intel Core i7-3740QM; 8 CPUs; 2,7 GHz
RAM: 8GB DDR3 (1600MHz)
Disk: SSD

Java 1.7.51
JVM: 2GB RAM

Titan: 0.4.4
Backend Storage: Persistit
Db cache size: 0.6

All your comments are to be send directly to the author of the article or to its thread on Reddit

Subscribe to our newsletter

We cherish your privacy.

By using this site, you agree with using our cookies.