Enabling large-scale next-generation sequence assembly with Blacklight

M Brian Couger; Lenore Pipes; Fabio Squina; Rolf Prade; Adam Siepel; Robert Palermo; Michael G Katze; Christopher E Mason; Philip D Blood

doi:10.1002/cpe.3231

Enabling large-scale next-generation sequence assembly with Blacklight

Concurr Comput. 2014 Sep 10;26(13):2157-2166. doi: 10.1002/cpe.3231.

Authors

M Brian Couger¹, Lenore Pipes², Fabio Squina³, Rolf Prade¹, Adam Siepel², Robert Palermo⁴, Michael G Katze⁴, Christopher E Mason⁵, Philip D Blood⁶

Affiliations

¹ Department of Microbiology and Molecular Genetics, Oklahoma State University, 1110 South Innovation Way. Stillwater, OK, 74078 USA.
² Department of Biological Statistics and Computational Biology, Weill Hall, Cornell University, Ithaca, NY, 14850 USA.
³ Laboratório Nacional de Ciência e Tecnologia do Bioetanol, Centro Nacional de Pesquisa em Energia e Materiais, Campinas-SP, 13083-970, Brazil.
⁴ Department of Microbiology, University of Washington, Seattle, WA, 98109 USA.
⁵ Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, USA.
⁶ Pittsburgh Supercomputing Center, Carnegie Mellon University, 300 S. Craig St. Pittsburgh, PA, 15213 USA.

Abstract

A variety of extremely challenging biological sequence analyses were conducted on the XSEDE large shared memory resource Blacklight, using current bioinformatics tools and encompassing a wide range of scientific applications. These include genomic sequence assembly, very large metagenomic sequence assembly, transcriptome assembly, and sequencing error correction. The data sets used in these analyses included uncategorized fungal species, reference microbial data, very large soil and human gut microbiome sequence data, and primate transcriptomes, composed of both short-read and long-read sequence data. A new parallel command execution program was developed on the Blacklight resource to handle some of these analyses. These results, initially reported previously at XSEDE13 and expanded here, represent significant advances for their respective scientific communities. The breadth and depth of the results achieved demonstrate the ease of use, versatility, and unique capabilities of the Blacklight XSEDE resource for scientific analysis of genomic and transcriptomic sequence data, and the power of these resources, together with XSEDE support, in meeting the most challenging scientific problems.

Keywords: NGS; RNA-seq; bioinformatics; data-intensive computing; de novo assembly; genome; genomics; high-performance computing; large shared memory computing; metagenome; primates; transcriptome.

Abstract

Grants and funding