Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 28;14:141.
doi: 10.1186/1471-2164-14-141.

Whole Human Genome Proteogenomic Mapping for ENCODE Cell Line Data: Identifying Protein-Coding Regions

Free PMC article

Whole Human Genome Proteogenomic Mapping for ENCODE Cell Line Data: Identifying Protein-Coding Regions

Jainab Khatun et al. BMC Genomics. .
Free PMC article


Background: Proteogenomic mapping is an approach that uses mass spectrometry data from proteins to directly map protein-coding genes and could aid in locating translational regions in the human genome. In concert with the ENcyclopedia of DNA Elements (ENCODE) project, we applied proteogenomic mapping to produce proteogenomic tracks for the UCSC Genome Browser, to explore which putative translational regions may be missing from the human genome.

Results: We generated ~1 million high-resolution tandem mass (MS/MS) spectra for Tier 1 ENCODE cell lines K562 and GM12878 and mapped them against the UCSC hg19 human genome, and the GENCODE V7 annotated protein and transcript sets. We then compared the results from the three searches to identify the best-matching peptide for each MS/MS spectrum, thereby increasing the confidence of the putative new protein-coding regions found via the whole genome search. At a 1% false discovery rate, we identified 26,472, 24,406, and 13,128 peptides from the protein, transcript, and whole genome searches, respectively; of these, 481 were found solely via the whole genome search. The proteogenomic mapping data are available on the UCSC Genome Browser at

Conclusions: The whole genome search revealed that ~4% of the uniquely mapping identified peptides were located outside GENCODE V7 annotated exons. The comparison of the results from the disparate searches also identified 15% more spectra than would have been found solely from a protein database search. Therefore, whole genome proteogenomic mapping is a complementary method for genome annotation when performed in conjunction with other searches.


Figure 1
Figure 1
Overview of bottom-up proteomics and proteogenomic mapping. After cell lysis, proteins are extracted from a biological sample and are proteolytically digested into peptides. The peptide mixture is commonly separated by liquid chromatography and introduced into a tandem mass spectrometer, which produces MS/MS spectra. The resulting spectra are matched against an in silico translation and proteolytic digestion of genomic DNA sequences in all six reading frames to identify peptides. The matched peptides are then mapped back to the DNA sequences to identify the genomic loci for the analyzed proteins.
Figure 2
Figure 2
The distribution of the number of peptide hits per protein/transcript. The x-axis represents the number of protein/transcripts and the y-axis represents the number of peptides that matched to that number of protein/transcripts. Only proteins/transcripts matched to 2 or more peptides are considered in the distribution. The points in blue represent the peptide hits from the GENCODE V7 annotated proteins, while the red points represent those from the GENCODE V7 annotated transcripts.
Figure 3
Figure 3
Venn diagram of distinct peptide identifications from the protein, transcript, and whole genome searches. The deep red segment in the center represents the 12,177 peptides identified from all three searches. The segment in red represents the 3,628 peptides identified solely from the GENCODE V7 protein search; the blue segment represents the 1,122 peptides identified solely from the GENCODE V7 transcript search; and the brown segment represents the 481 peptides identified solely from the whole genome search.
Figure 4
Figure 4
An example of unique GENCODE V7 intergenic proteogenomic matches. Panel A shows that these unique proteogenomic matches overlap with a protein-coding exon predicted by NScan. Blue boxes represent proteogenomic matches, green boxes represent predicted protein-coding exons, and black lines represent introns. Panel B summarizes the total MS/MS spectral support for each of the two matches in this region, where each vertical dark blue bar represents a distinct spectral match for the same peptide, with the height of the bar showing the E-value for the identification (E-values ranging from 1.0×10-1 to 1.0×10-4). More and/or taller bars indicate stronger support. Panel C shows ENCODE/Caltech RNA-Seq evidence and other transcriptional data for the same region. Both matches are identified from multiple spectra, indicating relatively strong support.
Figure 5
Figure 5
A UCSC Genome Browser screenshot showing proteogenomic coverage across chromosome 1, with several annotation sets. The red line at the top represents our proteogenomic matches. The annotation sets shown here include GENCODE V7, Ensembl, RefSeq, and the UCSC annotation. The black line at the bottom shows the human mRNAs from GenBank.

Similar articles

  • GENCODE: the reference human genome annotation for The ENCODE Project.
    Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ. Harrow J, et al. Genome Res. 2012 Sep;22(9):1760-74. doi: 10.1101/gr.135350.111. Genome Res. 2012. PMID: 22955987 Free PMC article.
  • GENCODE: producing a reference annotation for ENCODE.
    Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigo R. Harrow J, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S4.1-9. doi: 10.1186/gb-2006-7-s1-s4. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925838 Free PMC article.
  • Proteogenomic mapping of Mycoplasma hyopneumoniae virulent strain 232.
    Pendarvis K, Padula MP, Tacchi JL, Petersen AC, Djordjevic SP, Burgess SC, Minion FC. Pendarvis K, et al. BMC Genomics. 2014 Jul 8;15(1):576. doi: 10.1186/1471-2164-15-576. BMC Genomics. 2014. PMID: 25005615 Free PMC article.
  • Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes.
    Kumar D, Dash D. Kumar D, et al. Adv Exp Med Biol. 2016;926:1-10. doi: 10.1007/978-3-319-42316-6_1. Adv Exp Med Biol. 2016. PMID: 27686802 Review.
  • EGASP: the human ENCODE Genome Annotation Assessment Project.
    Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. Guigó R, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925836 Free PMC article. Review.
See all similar articles

Cited by 20 articles

See all "Cited by" articles


    1. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. - DOI - PMC - PubMed
    1. Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE, Gingeras TR, Kent WJ, Birney E, Wold B, Crawford GE. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9 10.1371/journal.pbio.1001046.
    1. Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. An Encyclopedia of Human DNA Elements (NCP000) Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PubMed
    1. Arthur JW, Wilkins MR. Using proteomics to mine genome sequences. J Proteome Res. 2004;3:393–402. doi: 10.1021/pr034056e. - DOI - PubMed
    1. Chaerkady R, Kelkar DS, Muthusamy B, Kandasamy K, Dwivedi SB, Sahasrabuddhe NA, Kim MS, Renuse S, Pinto SM, Sharma R. A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry. Genome Res. 2011;21(11):1872–1881. doi: 10.1101/gr.127951.111. - DOI - PMC - PubMed

Publication types

LinkOut - more resources