Analyzing large scale genomic data on the cloud with Sparkhit

Liren Huang; Jan Krüger; Alexander Sczyrba

doi:10.1093/bioinformatics/btx808

Analyzing large scale genomic data on the cloud with Sparkhit

Bioinformatics. 2018 May 1;34(9):1457-1465. doi: 10.1093/bioinformatics/btx808.

Authors

Liren Huang^{1

2

3}, Jan Krüger^{1

2}, Alexander Sczyrba^{1

2

3}

Affiliations

¹ Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.
² Center for Biotechnology - CeBiTec, Bielefeld University, Bielefeld 33615, Germany.
³ Computational Methods for the Analysis of the Diversity and Dynamics of Genomes, Bielefeld University, Bielefeld 33615, Germany.

Abstract

Motivation: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform.

Results: Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92-157 times faster than MetaSpark on metagenomic fragment recruitment and 18-32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data.

Availability and implementation: Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/.

Contact: asczyrba@cebitec.uni-bielefeld.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
High-Throughput Nucleotide Sequencing / methods*
Humans
Metagenomics / methods*
Microbiota / genetics
Sequence Analysis, DNA / methods
Software*