Teaser Image

indie.kim

Adventures of an independent software developer.

This is a rather short update about a script that I just had to writedevstrap.

devstrap is great for setting up a great development environment from scratch. Most tools and environments are installed/configured automagically (Oh My Zsh, MongoDB/RethinkDB, a proper Python, Scala, etc.). For the rest (Docker, Firefox, iTerm2), devstrap at least downloads thier installers.

Happy coding!

I recently visited the Bay Area on a CODAMONO business trip and I had some rather interesting meetings in sunny California. Here are the highlights.

Pachyderm is building a modern Hadoop – in their own words. After talking to Joe Doliner (Co-founder/CEO) I can say that their Pachyderm File System, utilization of Docker, and Pachyderm Pipelines sound extremely promising. I also met their developer team briefly and they certainly are dedicated.

RethinkDB is doing well. Not only is their new T-shirt design impressive, but so is their commitment to continuously improve their product. I talked a bit about particular performance bottlenecks with Slava Akhmechet (Co-founder) and it was great to hear that there is a lot of work in progress. More features will be coming too, but I will not spill them here. Sorry.

IBM opened a new technology center in San Francisco: the Spark Technology Center. Frederick Reiss gave a brilliant talk on SystemML and their contributions to Apache Spark. I believe that IBM will not only move mountains with this endeavor, but also a considerable amount of (real-world) Big Data!

Apart from that, I highly recommend visiting Roam Artisan Burgers. Their burgers are delicious!

This year’s BioHackathon took place in Nagasaki, Japan, and it was the first BioHackathon since I became an entrepreneur. So, as you do, I had my business sponsor Google Cloud instances with genomic data for hackathon attendees to play with.

For the playgrounds I used mouse genome data: GFF3 and GVF features/variations of Ensembl release 81 and VCF indels of Sanger’s MGP. All data was converted to JSON/JSON-LD via BioInterchange 2.0.4+105. Altogether, over 90 million genomic features with associated sample data and annotations were made available per instance (except for ArangoDB, see below).

MongoDB: Two n1-standard-2 instances (2 vCPUs, 7.5 GB memory) running MongoDB 3.0 with WiredTiger and MMAPv1 engines respectively. The MMAPv1 installation also featured the MongoDB Genghis web-interface.

RethinkDB: One n1-standard-2 instance (2 vCPUs, 7.5 GB memory) running RethinkDB 2.1.

ArangoDB: One n1-highmem-2 instance (2 vCPUs, 13 GB memory) running ArangoDB 2.6. Only GFF3 data was loaded on this instance, because ArangoDB is an in-memory database.

Let’s take a complex and untypical data format, normalize it so that it works with modern databases, then run some database benchmarks.

Background: Finding viable data integration/scaling solutions for the upcoming BioHackathon 2015 in Nagasaki, Japan. CODAMONO (my business) will provide Google Cloud Compute Engine instances for hackathon attendees to play around with genomic data in JSON/JSON-LD format.

Data Source: Public genomic data (Mouse Genome Project, MGP; v5; insertions and deletions).

Databases:

  1. MongoDB by MongoDB Inc. (NoSQL database, document oriented)
  2. RethinkDB by RethinkDB (NoSQL database, document oriented)
  3. Virtuoso by OpenLink Software (triple store, triple oriented)

Databases that did not make the cut (both triple stores):

  1. 4store by Garlik – only supports dated SPARQL 1.0 specification; development stopped 2012; returned incorrect query results (COUNT queries)
  2. AllegroGraph by Franz Inc. – license forbids benchmarking; inquired twice with Franz Inc. about getting a permission and was explicitly told that benchmarking is not permitted

Environment: Google Cloud Compute Engine instance: n1-standard-1 (1 vCPU, 3.75 GB memory).

Results:

  1. The benchmark had to be scaled down to 1/100th of the data in order to be able to load it into Virtuoso; MongoDB and RethinkDB on their own can load the full dataset without breaking into a sweat
  2. MongoDB outperforms both RethinkDB and Virtuoso (no joins)
  3. RethinkDB outperforms Virtuoso for the benchmarked join query
  4. RethinkDB with secondary indexes can be as fast (or faster) as MongoDB without explicitly declared indexes
  5. RethinkDB takes up the least disk space to store the data, followed by MongoDB, Virtuoso comes last

Honorable Mention: I received great support from Daniel Mewes of RethinkDB when I had trouble with a particular query (Phred-scaled genotype likelihood distribution).

The Actual Benchmark

This benchmark is focusing on bio-data, namely mouse genome data, but the results should carry forward to other types of data too. The genomic data is encoded in a 18GB VCF file that contains information about 10 million genomic features.

For benchmarking with MongoDB and RethinkDB, the VCF file was converted to JSON-LD (a backwards compatible extension to JSON) using BioInterchange 2.0. For the full VCF data set, the JSON-LD grows to 164GB and the 10 million genomic features are normalized as 10 million documents. However, in order to make the benchmark work with Virtuoso, only 160MB of the VCF file could be taken into account, or, 100,000 genomic features (100,000 JSON-LD documents, 1.4GB).

Apparently Virtuoso can also process JSON-LD, but I could not find information on how to load JSON-LD documents without avoiding Virtuoso’s REST interface. In order to load the data nevertheless, the JSON-LD was converted to RDF N-Triples. Even though I used a very straightforward JSON-LD to RDF N-Triple conversion, the RDF N-Triple size on disk blew up to 11GB over 79 million triples. I estimate that the full data set would have exploded to over 900GB and 7 billion triples.

Queries used in the benchmark:

  1. Count the genomic features that are annotated as having failed the “MinDP” filter.
  2. Calculate the distribution of Phred-scaled genotype likelihoods.
  3. Retrieve the genomic features within a region.
  4. Count the number of genomic positions in the MGP data that fall within mouse genes (mouse genes from Ensembl release 81; assembly difference not accounted for; 106,000 JSON-LD documents, 1.8 million triples)

Result Table:


Virtuoso
7.1.2
MongoDB
3.0
RethinkDB
2.1.0
Data loading
2-4h
(6.2GB on disk)
2min 46sec
(4.3GB on disk)
26-27min
(1.5GB on disk)
Count “MinDP” features
2-15sec (first)
2-3sec (uncached)
0.7sec (first)
110-240msec (uncached)
15sec (first)
7-8sec (w/o index)
170msec (index)
Phred-scaled likelihood distribution
1min 6sec (first)
1min 6sec (cached)
26-37sec (MapReduce)
4min 25sec (first)
4min 5sec (cached)
Features in a region
1.5sec (first)
390-600msec (uncached)
211-216msec 13sec (w/o index)
130-180msec (index)
Count features that fall within a gene
4min-9min (loading)
2min 13sec (first)
2min 8sec (cached)
n/a
1min 6sec (loading)
34sec (first, index)
9-15sec (cached)
Note: “(first)” is the time of the very first query run; “(uncached)” refers to subsequent queries without the use of the database management system’s cache; “(cached)” is referring to a query where the cache was not cleared – used when queries appeared to perform slowly to see whether the cache would improve query times; “(w/o index)” marks query times in RethinkDB for which no secondary index was used; “(index)” marks query times in RethinkDB for which a secondary index was created and then used by the query; “(MapReduce)” denotes that a MapReduce algorithm was used.

Took a peek at the source of a web-page – as you do – and I somehow get the impression that The Guardian might be hiring:

<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" data-page-path="/world/2015/aug/06/german-tv-presenter-anja-reschke-sparks-debate-support-refugees">
<!--

##::::: ##: ########::::::: ###:::: ########:: ########:::: ##:::: ##: ####: ########:: ####: ##::: ##:: ######::
##: ##: ##: ##.....::::::: ## ##::: ##.... ##: ##.....::::: ##:::: ##:. ##:: ##.... ##:. ##:: ###:: ##: ##... ##:
##: ##: ##: ##::::::::::: ##:. ##:: ##:::: ##: ##:::::::::: ##:::: ##:: ##:: ##:::: ##:: ##:: ####: ##: ##:::..::
##: ##: ##: ######:::::: ##:::. ##: ########:: ######:::::: #########:: ##:: ########::: ##:: ## ## ##: ##:: ####
##: ##: ##: ##...::::::: #########: ##.. ##::: ##...::::::: ##.... ##:: ##:: ##.. ##:::: ##:: ##. ####: ##::: ##:
##: ##: ##: ##:::::::::: ##.... ##: ##::. ##:: ##:::::::::: ##:::: ##:: ##:: ##::. ##::: ##:: ##:. ###: ##::: ##:
 ###. ###:: ########:::: ##:::: ##: ##:::. ##: ########:::: ##:::: ##: ####: ##:::. ##: ####: ##::. ##:. ######::

Ever thought about joining us?
http://developers.theguardian.com/join-the-team.html

-->
<head>
<meta charset="utf-8"/>