Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec;10(12):M111.007690.
doi: 10.1074/mcp.M111.007690. Epub 2011 Aug 29.

iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates

Affiliations

iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates

David Shteynberg et al. Mol Cell Proteomics. 2011 Dec.

Abstract

The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of shotgun proteomic data and the computational strategy. The protein sample is digested into peptides, with some peptides present in the unmodified and a modified (e.g. oxidized methionine) forms. The peptide sample is separated using liquid chromatography (LC) coupled online with a tandem mass spectrometer. The first stage of MS measures mass to charge ratios of peptide ions injected in the instrument at any given time. A peptide can be ionized into multiple peptide precursor ions having different charge state (e.g. 2+ and 3+). Selected peptide ions are subjected to MS/MS sequencing (some multiple times). Each acquired MS/MS spectrum is assigned a best matching peptide sequence using sequence database searching. When multiple search tools are applied in parallel (Search 1 and Search 2), each spectrum produces multiple peptide to spectrum matches (search-specific PSM level), which could be the same or different peptides summarized at the PSM level). Within the same LC-MS/MS run, the same peptide ion can be identified from multiple PSMs (run-specific peptide ion level). The experiment may consist of several LC-MS/MS analyses (Analysis 1 and 2), in which case the same peptide ion can be identified in multiple runs (peptide ion level). Considering the modification status, the same peptide can be identified in multiple forms (modification-specific peptide level), which are then further collapsed into a single identification at the unique peptide sequence level. Multiple unique peptide sequences may correspond to the same protein (protein level). PeptideProphet calculates the posterior probability of a correct PSM, individually for each search engine output. iProphet combines multiple lines of evidence and computes accurate probabilities at the level of unique peptide sequences, assisted by the introduction of new grouping variables: NSS, NRS, NSE, NSI, and NSM. ProteinProphet combines peptide probabilities to compute the protein probability (with an additional adjustment for NSP).
Fig. 2.
Fig. 2.
Discriminating power of computed probabilities. A, The number of correct PSMs as a function of FDR obtained using iProphet (solid blue line) and PeptideProphet (green dashes). Human data set, X! Tandem search. B, Same as (A), at the protein level, after application of ProteinProphet. C, The number of correct PSMs as a function of FDR obtained using iProphet when analyzing individual search engine results (six search engines listed in the box), and all search engines combined (solid blue curve). D, Same as (C), at the protein level, after application of ProteinProphet.
Fig. 3.
Fig. 3.
Accuracy of probability-based FDR estimates. FDR estimated using probabilities computed by the iProphet model (solid blue line) and by PeptideProphet (green dashes) plotted as a function of FDR estimated using decoys. A perfect agreement between the two methods (probability-based and decoy-based) is indicated by a 45-degree dotted line. A, X! Tandem, PSM level. B, X! Tandem, unique peptide sequence level. C, X! Tandem, protein level. D, All six search engines combined using iProphet with NSS model enabled (solid blue line), or simply by selecting the identification having the highest PeptideProphet probability across the individual search results (“naïve combination”). FDR estimated at the PSM level. E, All search engines combined using iProphet or using the naïve approach, unique peptide sequence level. F, All search engines combined using iProphet or using the naïve approach, protein level (after application of ProteinProphet).
Fig. 4.
Fig. 4.
Contribution of different models in iProphet. A, The number of correct PSMs as a function of FDR obtained using PeptideProphet (green dashes), iProphet (solid blue line), and using iProphet with only a single model enabled: NSM, NRS, NSE, or NSI. B, The distributions of the number of sibling ions, NSI, statistics among incorrect (red) and correct (blue) identifications. The shaded areas represent the actual distributions observed, P(NSI|−) and P(NSI|+), labeled as negative (N) and positive (P), respectively. The red and blue solid lines show the iProphet modeled distributions. The solid black curve represents the natural log of the ratio P(NSI|+)/P(NSI|−). When the ratio of the distributions is above 1 (0 on the log scale, indicated by the dotted horizontal line), the model boosts the probability of a PSMs having NSI value in that range, and reduces the probability in the range of NSI values where the ratio drops below 1. S. pyogenes data set, SEQUEST search.
Fig. 5.
Fig. 5.
The distributions of grouping statistics learned by iProphet. The negative (red) and positive (blue) distributions for all the five grouping variable used in iProphet. See Fig. 4B legend for detail. A, Number of replicate spectra, NRS. B, Number of sibling searches, NSS. C, Number of sibling ions, NSI. D, Number of sibling experiments, NSE. E, Number of sibling modifications, NSM. S. pyogenes data set, all search engine combined.
Fig. 6.
Fig. 6.
Comparison between iProphet, PeptideProphet, and Percolator. The number of correct PSMs as a function of FDR obtained using iProphet (solid blue line), PeptideProphet (green dashes), and Percolator (purple, dash dot), applied to SEQUEST search results. Inset shows an extended range of FDR values (up to 20%). A, Human data set. B, FFE-LTQ-FT subset of the S. pyogenes data set.
Fig. 7.
Fig. 7.
Overview of the possible TPP workflow. Analysis with iProphet can be performed as an intermediate step between PeptideProphet and ProteinProphet in a single search analysis or a combined search analysis.

Similar articles

Cited by

References

    1. Aebersold R., Mann M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207 - PubMed
    1. Yates J. R., Ruse C. I., Nakorchevsky A. (2009) Proteomics by Mass Spectrometry: Approaches, Advances, and Applications. Annu. Rev. Biomed. Eng. 11, 49–79 - PubMed
    1. Deutsch E. W., Lam H., Aebersold R. (2008) Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. Physiol. Genomics 33, 18–25 - PubMed
    1. Carr S., Aebersold R., Baldwin M., Burlingame A., Clauser K., Nesvizhskii A. (2004) The Need for Guidelines in Publication of Peptide and Protein Identification Data: Working Group On Publication Guidelines For Peptide And Protein Identification Data. Mol. Cell. Proteomics 3, 531–533 - PubMed
    1. Nesvizhskii A. I., Vitek O., Aebersold R. (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4, 787–797 - PubMed

Publication types