PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq

Bo Wen; Shaohang Xu; Ruo Zhou; Bing Zhang; Xiaojing Wang; Xin Liu; Xun Xu; Siqi Liu

doi:10.1186/s12859-016-1133-3

PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq

BMC Bioinformatics. 2016 Jun 17;17(1):244. doi: 10.1186/s12859-016-1133-3.

Authors

Bo Wen¹, Shaohang Xu¹, Ruo Zhou¹, Bing Zhang², Xiaojing Wang², Xin Liu¹, Xun Xu¹, Siqi Liu^{3

4}

Affiliations

¹ BGI-Shenzhen, Shenzhen, 518083, China.
² Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA.
³ BGI-Shenzhen, Shenzhen, 518083, China. siqiliu@genomics.cn.
⁴ Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. siqiliu@genomics.cn.

Abstract

Background: Peptide identification based upon mass spectrometry (MS) is generally achieved by comparison of the experimental mass spectra with the theoretically digested peptides derived from a reference protein database. Obviously, this strategy could not identify peptide and protein sequences that are absent from a reference database. A customized protein database on the basis of RNA-Seq data is thus proposed to assist with and improve the identification of novel peptides. Correspondingly, development of a comprehensive pipeline, which provides an end-to-end solution for novel peptide detection with the customized protein database, is necessary.

Results: A pipeline with an R package, assigned as a PGA utility, was developed that enables automated treatment to the tandem mass spectrometry (MS/MS) data acquired from different MS platforms and construction of customized protein databases based on RNA-Seq data with or without a reference genome guide. Hence, PGA can identify novel peptides and generate an HTML-based report with a visualized interface. On the basis of a published dataset, PGA was employed to identify peptides, resulting in 636 novel peptides, including 510 single amino acid polymorphism (SAP) peptides, 2 INDEL peptides, 49 splice junction peptides, and 75 novel transcript-derived peptides. The software is freely available from http://bioconductor.org/packages/PGA/ , and the example reports are available at http://wenbostar.github.io/PGA/ .

Conclusions: The pipeline of PGA, aimed at being platform-independent and easy-to-use, was successfully developed and shown to be capable of identifying novel peptides by searching the customized protein database derived from RNA-Seq data.

Keywords: MS/MS; Peptide identification; Proteogenomics; Proteomics; RNA-Seq.

MeSH terms

Databases, Protein
Humans
Peptides / isolation & purification*
Proteomics / methods*
Sequence Analysis, RNA*
Software*
Tandem Mass Spectrometry / methods*

Substances

Peptides