Our dev team’s second in command, Tomas Fechtner, prepared a short benchmark on searching a handy index in TitanDB.
Index is an essential component of a database design. One of its main purposes is to speed up queries which would be unusable without indices. This premise applies to traditional relational SQL/NoSQL databases, as well as to graph databases, such as, for instance, Titan which was our choice for our metadata storage. In this post, we would like to share with you an issue and our solution, concerning graph database indices.
As we already mentioned in the previous post, we had chosen Titan as the metadata storage database in Manta Tools. Apart from other elements, the data module also includes an ancestor/descendant relationship, e.g., for representing table relevancy in the diagram. In the graph domain, we can visualize it as a couple of vertices connected by an edge. Among others, the vertices will have a name attribute and the edge will has a hasParent label, going from descendant to ancestor.
The described part of the database data module is used for two approaches
- we know the descendant object and want to know the ancestor object;
- we know the ancestor object and descendant name and want to obtain the descendant object.
A descendant has a maximum of one ancestor, thus, in terms of performance, case no. 1 is not of our interest. On the other hand, an ancestor can have many descendants, i.e., the standard relationship of 1:N. The vertex name is not unique across the entire database. However, to simplify, let’s assume that a vertex can only have a single descendant with a particular name.
First of all, we are going to discuss the situation and why the index is required in general. We just need to go through all descendants of a particular ancestor, comparing their names with the one searched. Such traverse only involves a single edge and, in the development environment, everything can function quickly and seamlessly. The issue arises when, for instance, a vertex has tens of thousands of descendants, which can very well happen. Imagine, for example, a diagram and the number of its tables/views/procedures in a really large warehouse. So, an index would definitely come handy here. But which one?
The first, somewhat naive approach would involve selected an index as the name attribute. When querying, all vertices with the particular name would then be obtained, being subsequently looked through, while filtering those that are connected to the particular ancestor by an edge. This approach hits the wall when, in the entire graph (with millions or even tens of millions of vertices), there are several thousand nodes with the same name. The algorithm has to go through all of them, verifying their ancestors. In the execution plan language used for SQL databases, the particular edge is used as a filter, rather than an access condition.
With respect to the fact that, upon querying from the ancestor, the incident edges must be selected first while, then only, the algorithm gets to the vertices at their other ends, it is advisable to limit the fetched edges, For that reason, we have introduced a new edge attribute, the childName. As its name suggests, the attribute contains the name of the descendant at the end of the edge. Using this attribute, the index can be implemented, thus significantly limiting the actual disk operations for the edge selection, while subsequently fetching information about the vertices connected.
With this redundant information, which is, of course, reflected in increasing memory requirements, we have significantly sped up the process. However, for certain specific cases, the solution still did not suffice. We only arrived at an ideal solution by including the so-called vertex-centric indices. These are indices that are implemented for particular vertices; in other words, each vertex has its own index tree, unlike the standard indices which are common for the entire graph. Apart from other aspects, this also significantly shortens the leaf node chain.
Set 1 – 50k nodes with one super node; an artificial sample
Set 2 – 1 193k nodes without super node; complete data warehouse of medium size
Based on this experience, we would like to suggest that you think about all three queries for the 1:N type relationships. Are you sure they cannot develop the super node problem? Fortunately, Titan provides a tool to tackle this problem. However, it is still up to developers and analysts to identify the problem situations and apply the aforementioned tools.
A Solution to the Supernode Problem: http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/
OS: Win 7 64-bit
Processor: Intel Core i7-3740QM; 8 CPUs; 2,7 GHz
RAM: 8GB DDR3 (1600MHz)
JVM: 2GB RAM
Backend Storage: Persistit
Db cache size: 0.6