Fragment assignment in the cloud with eXpress-D

Adam Roberts; Harvey Feng; Lior Pachter

doi:10.1186/1471-2105-14-358

Fragment assignment in the cloud with eXpress-D

BMC Bioinformatics. 2013 Dec 7:14:358. doi: 10.1186/1471-2105-14-358.

Authors

Adam Roberts, Harvey Feng, Lior Pachter¹

Affiliation

¹ Department of Computer Science, 387 Soda Hall, UC Berkeley, Berkeley, CA 94720, USA. lpachter@math.berkeley.edu.

Abstract

Background: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability.

Results: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters-"the cloud". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data.

Conclusions: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems-such as new frameworks like Spark-for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Artificial Intelligence
Computational Biology / methods
Computer Communication Networks
Feasibility Studies
High-Throughput Nucleotide Sequencing / methods*
Humans
Likelihood Functions
Oligonucleotide Array Sequence Analysis / methods
Peptide Fragments / genetics
Probability
Programming Languages
Search Engine
Sequence Alignment
Software*
Transcriptome / genetics*
User-Computer Interface

Substances

Peptide Fragments

Grants and funding

HG006129/HG/NHGRI NIH HHS/United States