SPA: a short peptide assembler for metagenomic data

Youngik Yang; Shibu Yooseph

doi:10.1093/nar/gkt118

SPA: a short peptide assembler for metagenomic data

Nucleic Acids Res. 2013 Apr;41(8):e91. doi: 10.1093/nar/gkt118. Epub 2013 Feb 23.

Authors

Youngik Yang¹, Shibu Yooseph

Affiliation

¹ Informatics Department, J. Craig Venter Institute, San Diego, CA 92121, USA.

Abstract

The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
High-Throughput Nucleotide Sequencing
Metagenomics / methods*
Peptides / chemistry
Sensitivity and Specificity
Sequence Analysis, Protein / methods*

Substances

Peptides

Abstract

Publication types

MeSH terms

Substances

Grants and funding