Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments

Mol Cell Proteomics. 2013 Nov;12(11):3420-30. doi: 10.1074/mcp.M113.029165. Epub 2013 Aug 1.

Abstract

Recent advances in mass spectrometry (MS) have led to increased applications of shotgun proteomics to the refinement of genome annotation. The typical "proteo-genomic" workflows rely on the mapping of peptide MS/MS spectra onto databases derived via six-frame translation of the genome sequence. These databases contain a large proportion of spurious protein sequences which make the statistical confidence of the resulting peptide spectrum matches difficult to assess. Here we performed a comprehensive analysis of the Escherichia coli proteome using LTQ-Orbitrap MS and mapped the corresponding MS/MS spectra onto a six-frame translation of the E. coli genome. We hypothesized that the protein-coding part of the E. coli genome approaches complete annotation and that the majority of six frame-specific (novel) peptide spectrum matches can be considered as false positive identifications. We confirm our hypothesis by showing that the posterior error probability distribution of novel hits is almost identical to that of reversed (decoy) hits; this enables us to estimate the sensitivity, specificity, accuracy, and false discovery rate in a typical bacterial proteo-genomic dataset. We use two complementary computational frameworks for processing and statistical assessment of MS/MS data: MaxQuant and Trans-Proteomic Pipeline. We show that MaxQuant achieves a more sensitive six-frame database search with an acceptable false discovery rate and is therefore well suited for global genome reannotation applications, whereas the Trans-Proteomic Pipeline achieves higher specificity and is well suited for high-confidence validation. The use of a small and well-annotated bacterial genome enables us to address genome coverage achieved in state-of-the-art bacterial proteomics: identified peptide sequences mapped to all expressed E. coli proteins but covered 31.7% of the protein-coding genome sequence. Our results show that false discovery rates can be substantially underestimated even in "simple" proteo-genomic experiments obtained by means of high-accuracy MS and point to the necessity of further improvements concerning the coverage of peptide sequences by MS-based methods.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Databases, Genetic
  • Databases, Protein
  • Escherichia coli K12 / genetics
  • Escherichia coli K12 / metabolism
  • Escherichia coli Proteins / genetics*
  • Escherichia coli Proteins / metabolism*
  • Genome, Bacterial
  • Genomics / methods*
  • Genomics / statistics & numerical data
  • Proteome / genetics
  • Proteome / metabolism
  • Proteomics / methods*
  • Proteomics / statistics & numerical data
  • Tandem Mass Spectrometry / statistics & numerical data

Substances

  • Escherichia coli Proteins
  • Proteome