Metadata Repository Benchmark: Graph Database Titan

Metadata Repository Benchmark: Graph Database Titan

Engineer's Notes
September 13, 2014

We’ve already talked about issues regarding choosing the right database for a metadata repository in a data warehouse. And because this topic seems to have been well received by our audience, we are more than willing to share some more. In this blog post we are going to present benchmark results showing how fast the graph database Titan can retrieve information from the following storage backends:

In our test, we focused mostly on centralized storage (BerkeleyDB, PersistIt), since we figured that using distributed storage is not very common in metadata repositories.  One of our requirements was also an “embedded mode” option (so the database would work as “hidden” storage and there would be no need to install it, configure it and manage it separately). Because “embedded mode” is also available in Apache Cassandra, we included it in our benchmark. Our main focus was to properly examine PersistIt, because it has convenient licensing conditions. That’s why PersistIt was tested in three different variants while BerkeleyDB and Cassandra were only tested with default settings.
Architecture and Settings
The system was consistently divided into two parts: the server and the client. Communication between the client and the server was enabled via TCP/IP. The dispatcher module created the connection, communicated with the client, called exact server modules and reported back to the client. The Spring Framework was used for implementation. Each specialized server module (merger, exporter, query, etc.) had its own function. The module connector was used to access the metadata repository. The system architecture is shown in this diagram:

MRB Architecture

Our goal was to look up how many nodes and edges are transitively connected to one exact node and measure how long it takes to find out in every type of backend storage. We performed the test 89 times in each configuration.
The testing was performed using a machine similar to one anybody could expect a customer to have, everything with default settings:

  • Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, 4 cores, 64bit,
  • 8GB RAM, 2x DIMM DDR3 Synchronous 1333 MHz (0,8 ns)
  • ATA SAMSUNG 320GB, ext4
  • Debian 3.2.57-3+deb7u2 x86 64 GNU/Linux

The table below shows the aggregated results of our benchmark for each tested configuration. The first two columns show the configuration and the number of queries. The next two columns show the average response time and the maximum measured response time (both in milliseconds). The fifth and sixth columns show the maximum and average relative delay. The maximum response time for each query (1-89) is noted as the reference time. The relative delay is calculated as the actual response time of each query and the reference time ratio (1.00 is a default value).
It is obvious that PersistIt was the clear winner in this benchmark and that it happens to be the fastest. For more relevance, we also added a table with only first queries, when cache memory was empty. The “First Query Only” results more or less coincide with the aggregated results.
So, it is kind of obvious that PersistIt performs really well. The results from this benchmark suggest that performance-wise PersistIt is the best option for a metadata repository, especially compared to the other backend storage solutions for Titan, even though it is supposed to be expermimental for now.
Do you have any questions or comments on this benchmark? Feel free to send us your feedback in this Reddit. You can also get in touch with us on Twitter or via email. This benchmark was performed in cooperation with Faculty of Information Technology of Czech Technical University.   

Jan Andrš

VP of Marketing at MANTA