Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 2;9(1):903.
doi: 10.1038/s41467-018-03311-y.

Discovery of Coding Regions in the Human Genome by Integrated Proteogenomics Analysis Workflow

Affiliations
Free PMC article

Discovery of Coding Regions in the Human Genome by Integrated Proteogenomics Analysis Workflow

Yafeng Zhu et al. Nat Commun. .
Free PMC article

Erratum in

Abstract

Proteogenomics enable the discovery of novel peptides (from unannotated genomic protein-coding loci) and single amino acid variant peptides (derived from single-nucleotide polymorphisms and mutations). Increasing the reliability of these identifications is crucial to ensure their usefulness for genome annotation and potential application as neoantigens in cancer immunotherapy. We here present integrated proteogenomics analysis workflow (IPAW), which combines peptide discovery, curation, and validation. IPAW includes the SpectrumAI tool for automated inspection of MS/MS spectra, eliminating false identifications of single-residue substitution peptides. We employ IPAW to analyze two proteomics data sets acquired from A431 cells and five normal human tissues using extended (pH range, 3-10) high-resolution isoelectric focusing (HiRIEF) pre-fractionation and TMT-based peptide quantitation. The IPAW results provide evidence for the translation of pseudogenes, lncRNAs, short ORFs, alternative ORFs, N-terminal extensions, and intronic sequences. Moreover, our quantitative analysis indicates that protein production from certain pseudogenes and lncRNAs is tissue specific.

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Full pI range (3–10) HiRIEF provides broad peptidome and proteome coverage. a The top panel shows the comparison between experimental and theoretical pI distributions of TMT-labeled peptides from the A431 cell line data set. The six major peaks in the theoretical pI distribution represent groups of peptides with characteristic amino acid compositions. For example, peptides with a higher number of Asp (D) and Glu (E) residues than the total number of Lys (K), Arg (R), and His (H) will have a pI between 3.5 and 5. The middle panel shows the accuracy of pI prediction by the PredpI algorithm across the full pI range. The bottom panel shows the experimental pI ranges of the four IPG strips employed in this study. Nominal pH ranges are indicated on the left side with actual pH ranges next to the bars. See Supplementary Figures 4, 6, and 7 for pI fraction resolution, reproducibility and yield. b Overlap of identified fully tryptic human peptides (at protein level FDR 1%) between the A431 cells data set, the normal tissues data set and the public peptide repository PeptideAtlas (release 2017-01)
Fig. 2
Fig. 2
A proteogenomics workflow to discover, curate, and validate novel and SAAV peptides. The pipeline consists of three major stages: discovery, curation, and validation. The discovery stage is performed with MS-GF+ using two database strategies. Type 1 search was performed against a single database consisting of known peptides concatenated with variant peptides. Type 2 search is enabled by HiRIEF peptide fractionation and was performed against pI-restricted databases of tryptic peptides generated from a six-frame translation (6FT) of the human genome. The discovery stage outputs 1% class-specific FDR for novel and SAAV peptides. In the curation stage, candidate SAAV peptides are curated by SpectrumAI. The novelty of candidate novel peptides from the discovery stage is ensured by BLASTP analysis against known protein databases including Uniprot reference proteome (with isoforms), Ensembl human protein database v83, RefSeq and Gencode v24, and the subset of novel peptides with single amino acid substitution are also curated by SpectrumAI. In the validation stage, quality control plots such as delta pI, precursor mass error, and search engine score distribution are made. In addition, curated novel peptides are evaluated for orthogonal data support in, e.g., RNA-seq data, ribosome profiling and CAGE data, conservation and coding potential prediction
Fig. 3
Fig. 3
SpectrumAI increases identification accuracy of peptides with single amino acid changes. a Precursor mass error distributions of peptides classified as curated and discarded by SpectrumAI. b Curated SAAV peptides have more overlap with missense variants identified at DNA and RNA level. c Mirror plot of an incorrectly identified peptide (that yet had passed discovery stage with class-specific FDR 1%) with a single residue substitution (V > L, at position 8) that was subsequently discarded by SpectrumAI. Annotated MS2 spectrum of the endogenous peptide is shown on top, whereas that of the respective synthetic peptide is inverted and shown on bottom. This incorrect peptide identification detected by SpectrumAI shows mismatching b6 and b7 product ions (highlighted in the synthetic side and missing in the endogenous side), which ought to have flanked the substituted residue, indicating that the endogenous amino acid sequence is incorrect between its sixth and eighth residues
Fig. 4
Fig. 4
Unannotated protein-coding loci found in the A431 cells data set. a The left pie chart shows the number of unannotated protein-coding loci supported by one, two, or more peptides (peptides within 10 kb distance were grouped into one locus); the right pie chart shows the different types of unannotated coding events supported by multiple peptides. b Automatic categorization of novel peptides by annovar using RefSeq gene annotation (see “Methods”). c Manhattan plot of novel peptides, where the y-axis represents the peptide’s posterior error probability (PEP). d Orthogonal data support for novel peptides, including PhyloCSF coding potential, conservation analysis, A431 cell line RNA-seq reads evidence, ribosome profiling, CAGE (up to 500 bp upstream from peptide location), presence of neighboring peptides (within 10 kb), and whether the peptide was identified in the draft proteome data of Kim et al. and Wilhelm et al.. Continuous variables were discretized to binary values 0 or 1 for visualization purposes. 10,000 random genomic loci were used to determine the threshold to call if Ribo-seq or CAGE data were supportive or not (see Supplementary Figure 20). e The conservation score (PhastCons score) distribution of pseudogenes and lncRNAs for which peptides were found was compared to that of 1000 randomly selected pseudogenes and lncRNAs. In the box plots, center line corresponds to median, box boundaries correspond to the first and third quartiles (Q1 and Q3), the upper whisker is min(max(x), Q3 + 1.5 × IQR) and lower whisker is max(min(x), Q1–1.5 × IQR)
Fig. 5
Fig. 5
Examples of unannotated protein-coding regions discovered. Gray lines indicate introns, black thick lines are UTRs, colored boxes are coding regions (color indicates reading frame). Novel peptides are shown as red boxes unless they are in different reading frames. a Pseudogene TATDN2P1 protein identified with two novel peptides linked in the same open reading frame. b LncRNA ENSG00000267943 protein identified with four novel peptides. c An alternative reading frame protein of the DRAP1 gene was identified with four novel peptides. The color of exons and novel peptides indicates reading frame. Exons and peptides in same colors (darker shade for peptides) are in the same reading frame. d Alternative protein N terminus for gene C1orf122 was identified with two novel peptides. e Two novel peptides serving as evidence for the existence of “retained intron” translation for the EGFR gene. f Extended exon protein variant of gene MPRIP was identified with three novel peptides
Fig. 6
Fig. 6
Quantitative analysis of novel peptides identified in the normal tissues data set. a Novel peptide tissue expression. Pearson correlation and complete linkage method was used for clustering. Row Z-scores are shown in the heat map. b TMT-based tissue quantification of the pseudogene TATDN2P1 peptides points to testis specificity. The three dots in the TMT ratio plots indicate quantification of three individual PSMs, with the center bar as the mean and error bars as standard deviation. c TMT-based tissue quantification of TATDN2 peptides indicates broad tissue expression (quantification values from three PSMs). d RNA-seq read counts of TATDN2P1 in different tissues confirms testis specificity

Similar articles

See all similar articles

Cited by 15 articles

See all "Cited by" articles

References

    1. Branca RM, et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat. Methods. 2014;11:59–62. doi: 10.1038/nmeth.2732. - DOI - PubMed
    1. Alfaro JA, Sinha A, Kislinger T, Boutros PC. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods. 2014;11:1107–1113. doi: 10.1038/nmeth.3138. - DOI - PubMed
    1. Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat. Methods. 2014;11:1114–1125. doi: 10.1038/nmeth.3144. - DOI - PMC - PubMed
    1. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet. 2014;15:193–204. doi: 10.1038/nrg3520. - DOI - PubMed
    1. Li H, et al. Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. BMC Genomics. 2016;17:1031. doi: 10.1186/s12864-016-3327-5. - DOI - PMC - PubMed

Publication types

Feedback