Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov;17(11):2270-2283.
doi: 10.1074/mcp.TIR118.000850. Epub 2018 Aug 9.

gpGrouper: A Peptide Grouping Algorithm for Gene-Centric Inference and Quantitation of Bottom-Up Proteomics Data

Affiliations

gpGrouper: A Peptide Grouping Algorithm for Gene-Centric Inference and Quantitation of Bottom-Up Proteomics Data

Alexander B Saltzman et al. Mol Cell Proteomics. 2018 Nov.

Abstract

In quantitative mass spectrometry, the method by which peptides are grouped into proteins can have dramatic effects on downstream analyses. Here we describe gpGrouper, an inference and quantitation algorithm that offers an alternative method for assignment of protein groups by gene locus and improves pseudo-absolute iBAQ quantitation by weighted distribution of shared peptide areas. We experimentally show that distributing shared peptide quantities based on unique peptide peak ratios improves quantitation accuracy compared with conventional winner-take-all scenarios. Furthermore, gpGrouper seamlessly handles two-species samples such as patient-derived xenografts (PDXs) without ignoring the host species or species-shared peptides. This is a critical capability for proper evaluation of proteomics data from PDX samples, where stromal infiltration varies across individual tumors. Finally, gpGrouper calculates peptide peak area (MS1) based expression estimates from multiplexed isobaric data, producing iBAQ results that are directly comparable across label-free, isotopic, and isobaric proteomics approaches.

Keywords: Bioinformatics software; Cancer Biology; Label-free quantification; Mass Spectrometry; Mouse models; Quantification; iTRAQ; patient derived xenograft; protein inference; shared peptides.

PubMed Disclaimer

Figures

None
Graphical abstract
Fig. 1.
Fig. 1.
Gene-centric grouping is a robust method for inference and quantitation of gene product expression in single and mixed species samples. A, Proportions of distinguishable proteins in HeLa (human, left) and HeLa/3T3 mixture (human/mouse, right) proteome profiling data. A protein can be inferred from the identified peptide pool in ∼5% of cases. An additional 6–10% of unique protein assignments are from trivial cases where only one possible protein isoform is annotated for a given gene product. B, Proportions of protein isoforms that are distinguishable at the gene product level in human and human/mouse proteome profiling data. The majority of protein groups map to a single gene locus; and peptide coverage is insufficient to definitively identify an isoform in the majority of these cases. C, Comparison of human and mouse calumenin protein inference, peptide assignments, and quantitation in the human/mouse mixture sample. The expected quantities are calculated from corresponding profiling of separate HeLa or 3T3 lysates. The results from the mixed sample were assembled via gene-centric approach by gpGrouper and protein-centric approach by MaxQuant (without cross-species peptide elimination). Razor peptides are assigned by winner-take-all method, and the whole quantity of the razor peptide is used in quantitation of its corresponding protein-centric group. Note that ProteinGI 41282022, which is definitively identified in separate 3T3 cell profiling, is parsimoniously eliminated by protein-centric grouping in mixed species data.
Fig. 2.
Fig. 2.
Qualitative binning as an alternative to subset elimination. A, Definition parameters for Strict, Relaxed, and All (SRA) qualities of gene product identifications. B, Definition rules for IDSet classes of gene products. This three-tiered annotation system indicates whether gene product identifications are based on peptides with unambiguous gene locus mapping and demarcates identifications with subset peptide evidence. C, Definition parameters for PSM IDGroup bins. The lowest PSM IDGroup from peptides mapped to a given gene product is assigned as the gene-level IDGroup. D, Exploratory analysis of gpGrouper identification results from profiling of two basal and two luminal PDX tumors. Examples of gpGrouper metrics for (1) a subset of well-characterized PAM50 luminal markers, (2) a selection of gene products with borderline identifications in basal cancers, and (3) a selection of low abundance gene products with consistent expression quantities across in luminal tumors, but variable spectral match qualities.
Fig. 3.
Fig. 3.
Validation of distribution algorithm for peak areas of peptides shared across multiple gene products. A, Experimental approach for benchmarking the accuracy of splitting shared peptides by unique peptide ratios. A specialized 1:1 mixture of human TMT126-tagged peptides and mouse TMT129-tagged peptides was made. A given peptide mapping to both species will elute as a single MS1 peak. The AUC value of said peak can be split by relative reporter ion ratios from its SPS-MS3 spectrum to determine expected AUC distribution. B, Theoretical scenario by which a shared peptide is split across two gene products (one mouse and one human gene in this dataset), that also have unique-to-gene peptides. The calculation of the distributed peptide area sum for each gene product comprises of the sum of unique peptides and shared peptides after weighting by the unique peptide ratio. C–D, Analyses of peptides that are shared across species and map to genes that also contain one or more unique peptides. C, Correlation plot for 5,590 shared peptide quantities distributed according to unique peptide ratios (gpGrouper AUC) versus expected quantities measured by TMT reporter ion ratios. D, The histogram of differences between expected and gpGrouper estimated quantities for distributed areas of shared peptides.
Fig. 4.
Fig. 4.
Validation of gpGrouper algorithm for estimation of tumor percentage and protein quantification in PDX samples. A, Schematic depicting the 5 cell mixtures used to test human/mouse proteome deconvolution by gpGrouper. Cells were lysed and digested with trypsin separately before mixing in the given ratios by peptide amount. B, Measured percentages of human and mouse proteins in each mixture reproducibly match the expected values (n = 3), with a slight ∼3% bias toward human assignments. C, Correlation and distribution plots of human gene products in the 1:1 mixture versus their expected levels after (1) only using unique-to-human peptides (2) only grouping with the human RefSeq, and (3) grouping against a human/mouse concatenated RefSeq and distributing peptides peak areas across species when necessary. D, Examples of varying levels of stromal infiltration across PDX replicates of breast cancer tumors from 4 patients. BCM-5998 PDXs consistently shows a human composition above 80%. BCM-3469 PDXs, while lower, are consistent at nearly 75% human. BCM-3611 PDXs are more variable, with percentages ranging from 40 to 60% human. Finally, the BCM-4913 model is extremely inconsistent with the human composition ranging from 3 to 65%.
Fig. 5.
Fig. 5.
Comparison of gpGrouper iBAQ-based expression estimates from label-free and isobaric proteome profiling of WHIM PDXs. A, Schematic describing the MS1 splitting procedure used by gpGrouper on isobaric profiling data. For a given PSM, the relative ratios of the reporter ions (in this case from an iTRAQ 4-PLEX) are used to split the corresponding MS1 peak area. The quantified value for each PSM and gene product is then reported separately for each channel (representing distinct samples) based on this split. B, Unsupervised clustering of the WHIM PDX breast tumor proteomic data previously published by CPTAC (“iTRAQ” dataset), by gpGrouper using the same input PSMs data (“iTRAQ-gpG” dataset), and on the same tumor models analyzed via label-free profiling (“LFree-gpG” dataset). C, Pearson correlation value matrix for each tumor as analyzed by gpGrouper on the iTRAQ and label-free data.

Similar articles

Cited by

References

    1. Aebersold R., and Mann M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207 - PubMed
    1. Nesvizhskii A. I., and Aebersold R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4, 1419–1440 - PubMed
    1. Huang T., Wang J., Yu W., and He Z. (2012) Protein inference: a review. Brief Bioinforma. 13, 586–614 - PubMed
    1. Li Y. F., and Radivojac P. (2012) Computational approaches to protein inference in shotgun proteomics. BMC Bioinformatics 13, S4 - PMC - PubMed
    1. Tentler J. J., Tan A. C., Weekes C. D., Jimeno A., Leong S., Pitts T. M., Arcaroli J. J., Messersmith W. A., and Eckhardt S. G. (2012) Patient-derived tumour xenografts as models for oncology drug development. Nat. Rev. Clin. Oncol. 9, 338–350 - PMC - PubMed

Publication types