Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;12(6):579-93.
doi: 10.1586/14789450.2015.1103186. Epub 2015 Oct 23.

The Potential Clinical Impact of the Release of Two Drafts of the Human Proteome

Free PMC article

The Potential Clinical Impact of the Release of Two Drafts of the Human Proteome

Iakes Ezkurdia et al. Expert Rev Proteomics. .
Free PMC article


The authors have carried out an investigation of the two "draft maps of the human proteome" published in 2014 in Nature. The findings include an abundance of poor spectra, low-scoring peptide-spectrum matches and incorrectly identified proteins in both these studies, highlighting clear issues with the application of false discovery rates. This noise means that the claims made by the two papers - the identification of high numbers of protein coding genes, the detection of novel coding regions and the draft tissue maps themselves - should be treated with considerable caution. The authors recommend that clinicians and researchers do not use the unfiltered data from these studies. Despite this these studies will inspire further investigation into tissue-based proteomics. As long as this future work has proper quality controls, it could help produce a consensus map of the human proteome and improve our understanding of the processes that underlie health and disease.

Keywords: Clinical applications; false discovery rates; human proteome; protein coding genes; proteomics.


Figure 1.
Figure 1.
Proportion of human proteins detected by UniProt evidence category. The percentage of proteins identified within each of the five UniProt evidence codes by the Wilhelm analysis,[ 2 ] the Kim analysis [ 1 ] and by the Ezkurdia et al. analysis.[ 14 ] We calculated the evidence codes from the Kim analysis by mapping all 292,000 peptides detected by Kim et al. to the GENCODE annotation [ 15 ] in the same manner as the Kim analysis. The Kim analysis would have identified 18,230 genes if they had searched against the GENCODE annotation in the same way as they searched against the RefSeq database.[ 9 ]
Figure 2.
Figure 2.
Illustrating the difference between a good and a poor peptide-spectrum match. (A) A good peptide-spectrum match for the peptide VILHLKEDQTEYLEER, a peptide shared by HSP90AB1 and by several other genes. Note that almost all the b-series ions and the y-series ions in the image and in the legend on the right have been correctly identified (correct identification is indicated by the colo r and by the label in the image). (B) A poor peptide-spectrum match for the peptide MSGTNQAAVSEFLLLGLSR, a peptide that maps to the olfactory receptor OR1F1. Just three of the b-series ions and two of the y-series ions have been correctly mapped (again, shown by the label in the image and the colo r in the image and the legend on the right) and none of the correct mappings were consecutive. Both spectra came from ProteomicsDB.
Figure 3.
Figure 3.
Illustrating how combining experiments increments the false discovery rate. The illustration shows the effect of combining two imaginary experiments, experiment s 1 and 2. In the figure , the yellow boxes represent true positive peptide hits, the pink boxes represent false positive peptide identifications. The real peptide false positive rate for both experiment s 1 and 2 is 10% (one false positive event in 10). However, when the two imaginary experiments are combined , the number of true positive hits only rises to 11 because 7 of the peptides were identified in both experiments. The false positive identifications were not the same in both experiments, so the real peptide false positive rate rises to 15.39% ( 2 in 13). In general , many of the true positive peptide hits are repeated across experiments and few of the false positive identifications are repeated, so the false discovery rate will always go up when experiments are combined – and the more experiments that are combined, the greater the effect as it gets harder and harder to identify peptides that have not previously been identified in another experiment.
Figure 4.
Figure 4.
Examples of the many poor spectra from the Kim analysis. (A) One of the two very poor spectra used to identify peptide TISFGGCVVQIFFIHAVGGTEMVLLIAMAFDRYVAICKPLHYLTIMNPQR for gene OR4F6. The Mascot scores of the two matches are very low, 3.22 and 2.57, only a handful of ions are properly identified. (B) A very poor spectrum for the peptide DVAVVFTEEELELLDSTQRQLYQDVMQENFR, which is the only peptide that identifies gene ZNF229. Only the y-series is shown for this +4 charge spectrum, very few y-series ions are identified. (C) A very poor spectrum for peptide MGYFLKLYAYVNSHSLFVWVCDR, which is used to identify EBLN2. Here just a single ion is identified. It is worth noting that this peptide is supposed to have both an N -terminal acetylation. All these spectra are from the Human Proteome Map from the Kim analysis.
Figure 5.
Figure 5.
Examples of the many poor spectra from the Wilhelm analysis. (A) One of the three poor spectra used to identify peptide VGLSSPR for gene LINC00346. This peptide was identified with an Andromeda score of 71.95. No consecutive ions in the series were identified. (B) The very poor spectrum for peptide MRPQPRGGSGR, which maps to gene LINC00346. The peptide is supposed to be N -terminal acetylated. None of the fragments are identified. C. One of three poor spectra for peptide SYKRSFRMILNK, which is used to identify EBLN2. Again very few fragments are identified. All these spectra are from the ProteomicsDB and from the Wilhelm analysis.
Figure 6.
Figure 6.
Poor spectrum with a delta score of>10. One of two poor spectra that identify peptide GQGVPISCK for gene LINC00346. This peptide was identified with Mascot delta score of 17.95, but the Mascot score was just 24.02, a score that is worse than the 5% local peptide cut off used in the main study. Again few ions in the two series are identified.

Comment on

  • A draft map of the human proteome.
    Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, Thomas JK, Muthusamy B, Leal-Rojas P, Kumar P, Sahasrabuddhe NA, Balakrishnan L, Advani J, George B, Renuse S, Selvan LD, Patil AH, Nanjappa V, Radhakrishnan A, Prasad S, Subbannayya T, Raju R, Kumar M, Sreenivasamurthy SK, Marimuthu A, Sathe GJ, Chavan S, Datta KK, Subbannayya Y, Sahu A, Yelamanchi SD, Jayaram S, Rajagopalan P, Sharma J, Murthy KR, Syed N, Goel R, Khan AA, Ahmad S, Dey G, Mudgal K, Chatterjee A, Huang TC, Zhong J, Wu X, Shaw PG, Freed D, Zahari MS, Mukherjee KK, Shankar S, Mahadevan A, Lam H, Mitchell CJ, Shankar SK, Satishchandra P, Schroeder JT, Sirdeshmukh R, Maitra A, Leach SD, Drake CG, Halushka MK, Prasad TS, Hruban RH, Kerr CL, Bader GD, Iacobuzio-Donahue CA, Gowda H, Pandey A. Kim MS, et al. Nature. 2014 May 29;509(7502):575-81. doi: 10.1038/nature13302. Nature. 2014. PMID: 24870542 Free PMC article.

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles


    1. Papers of special note have been highlighted as:
    1. • of interest
    1. •• of considerable interest
    1. Kim MS, Pinto SM, Getnet D. A draft map of the human proteome. Nature. 2014;509:575–581. - PMC - PubMed

•• One of the two papers studied in depth for this article. A proteomics analysis carried out wholly on tissues and hematopoietic cells.

    1. Wilhelm M, Schlegl J, Hahne H. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509:582–587. - PubMed

•• The other paper that is the subject of this article. The tissue and fluid proteomics experiments were only a small part of this study.

    1. Venter JC, Adams MD, Myers EW. The sequence of the human genome. Science. 2001;291 - PubMed
    1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409
    1. Koenig T, Menze BH, Kirchner M. Robust prediction of the MASCOT score for an improved quality assessment in mass spectrometric proteomics. J Proteome Res. 2008;7:3708–3717. - PubMed
    1. Cox J, Neuhauser N, Michalski A. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res. 2011;10:1794–1805. - PubMed
    1. UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–212. - PMC - PubMed

•• A paper that is a counterpoint to the twoNature articles. The authors found that proteomics analyses detect peptides from the most ancient genes and very few from recently evolved genes. Proteins the two Nature studies claimed to have detected will have been removed from the reference genome as a result of this article

    1. Harrow J, Frankish A, Gonzalez JM. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 2012;22:760–774. - PMC - PubMed
    1. NCBI Resource Coordinators Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2015;43:D16–17.
    1. Verbeurgt C, Wilkin F, Tarabichi M. Profiling of olfactory receptor gene expression in whole human olfactory mucosa. PLoS One. 2014;9:e96333. - PMC - PubMed
    1. Deutsch EW, Sun Z, Campbell D. The state of the human proteome in 2014/2015 as viewed through PeptideAtlas: enhancing accuracy and coverage through the AtlasProphet. J Proteome Res. 2015 - PMC - PubMed

•• Another contrast to the two Nature papers. The PeptideAtlas update very elegantly finds that the two studies add no more than 500 proteins to those already identified in experiments on cell lines.

    1. Ezkurdia I, Vázquez J, Valencia A. Analyzing the first drafts of the human proteome. J Proteome Res. 2014;13:3854–3855. - PMC - PubMed
    1. Ezkurdia I, Vázquez J, Valencia A. Correction to “Analyzing the first drafts of the human proteome”. J Proteome Res. 2015;14:1991. - PMC - PubMed
    1. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4:207–214. - PubMed

• One of the first papers to propose the calculation of false positive rates using decoy peptides.

    1. Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteom. 2010;73:2092–1223. - PMC - PubMed

• A detailed review of the use of false discovery rates in proteomics experiments, showing how errors are amplified when going from peptide to protein level.

    1. Serang O, Käll L. Solution to statistical challenges in proteomics is more statistics, not less. J Proteome Res. 2015 - PubMed
    1. Reiter L, Claassen M, Schrimpf SP. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics. 2009;8:2405–2417. - PMC - PubMed
    1. Savitski MM, WIlhelm M, Hahne H. A scalable approach for protein false discovery rate estimation in large proteomic data sets. Mol Cell Proteomics. 2015 doi: 10.1074/mcp.M114.046995. - DOI - PMC - PubMed
    1. Gaudet P, Michel PA, Zahn-Zabal M. The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res. 2015;43:D764–770. - PMC - PubMed
    1. Colaert N, Van Huele C, Degroeve S. Combining quantitative proteomics data processing workflows for greater sensitivity. Nat Methods. 2011;8:481–483. - PubMed

• This paper sets out the potential harmful effects of combining large-scale high-throughput proteomics and insufficiently validated data.

    1. Cooper B. The problem with peptide presumption and the downfall of target-decoy false discovery rates. Anal Chem. 2012;84:9663–9667. - PubMed

• Explains how recent advances in high-throughput proteomics can easily lead to identifying peptides that do not exist.

    1. Bonzon-Kulichenko E, Garcia-Marques F, Trevisan-Herraz M. Revisiting peptide identification by high-accuracy mass spectrometry: problems associated with the use of narrow mass precursor windows. J Proteome Res. 2015;14:700–710. - PubMed

• The authors set out solutions for the problems identified in Ref. [33]

    1. Omenn GS, Lane L, Lundberg EK. Metrics for the human proteome project 2015: progress on the human proteome and guidelines for high-confidence protein identification. J Proteome Res. 2015 - PMC - PubMed

•• Details many of the shortcomings of the two Nature analyses and addresses the state of the art in protein detection.

    1. Horvatovich P, Lundberg EK, Chen YJ. Quest for missing proteins: update 2015 on chromosome-centric human proteome project. J Proteome Res. 2015 - PubMed
    1. Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11:1114–1125. - PMC - PubMed

• The paper discusses the concepts and potential pitfalls of proteogenomics studies in considerable detail.

    1. Krug K, Carpy A, Behrends G. Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol Cell Proteomics. 2013;12:3420–3430. - PMC - PubMed
    1. Ross PL, Huang YN, Marchese JN. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics. 2004;3:1154–1169. - PubMed
    1. Ezkurdia I, Rodriguez JM, Carrillo-de Santa Pau E. Most highly expressed protein-coding genes have a single dominant isoform. J Proteome Res. 2015;14:1880–1887. - PMC - PubMed
    1. Abascal F, Ezkurdia I, Rodriguez-Rivas J. Alternatively spliced homologous exons have ancient origins and are highly expressed at the protein level. PLoS Comput Biol. 2015;11:e1004325. - PMC - PubMed
    1. Huang Da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57. - PubMed

• This large-scale study concentrates on cancer cells instead of tissues. Combining large-scale proteomics analysis of healthy and diseased cells has promise for the detection of biomarkers.

    1. Narayanan R. Phenome-genome association studies of pancreatic cancer: new targets for therapy and diagnosis. Cancer Genom Proteom. 2015;12:9–19. - PubMed
    1. Narayanan R. Ebola-associated genes in the human genome: implications for novel targets. MOJ Proteom Bioinform. 2015;1:00032.
    1. Shao S, Guo T, Aebersold R. Mass spectrometry-based proteomic quest for diabetes biomarkers. Biochim Biophys Acta. 2015;1854:519–527. - PubMed

• In this work, the authors review the current status of diabetes mellitus biomarker discovery through different mass spectrometry techniques.

    1. Hathout Y. Proteomic methods for biomarker discovery and validation. Are we there yet? Expert Rev Proteom. 2015;12:329–331. - PubMed

• A review detailing recent advances in the discovery of protein biomarkers via proteomics and the difficulties of validating these biomarkers.

    1. Aebersold R, Bader GD, Edwards AM. The biology/disease-driven human proteome project (B/D-HPP): enabling protein research for the life sciences community. J Proteome Res. 2013;12:23–27. - PubMed
    1. Zhang K, Fu Y, Zeng WF. A note on the false discovery rate of novel peptides in proteogenomics. Bioinformatics. 2015;31:3249–3253. - PMC - PubMed
    1. Ma J, Ward CC, Jungreis I. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res. 2014;13:1757–1765. - PMC - PubMed
    1. Vanderperre B, Lucier JF, Bissonnette C. Direct detection of alternative open reading frames translation products in human significantly expands the proteome. PLoS One. 2013;8:e70698. - PMC - PubMed
    1. Brusniak MY, Chu CS, Kusebauch U. An assessment of current bioinformatic solutions for analyzing LC-MS data acquired by selected reaction monitoring technology. Proteomics. 2012;12:1176–1184. - PMC - PubMed

The authors find that lincRNA behave differently from protein coding transcripts when passing through the ribosome

    1. Ruiz-Orera J, Messeguer X, Subirana JA. Long non-coding RNAs as a source of new peptides. Elife. 2014;3:e03523. - PMC - PubMed
    1. Griss J, Perez-Riverol Y, Hermjakob H. Identifying novel biomarkers through data mining-a realistic scenario? Proteom Clin Appl. 2015;9:437–443. - PMC - PubMed
    1. Khatun J, Yu Y, Wrobel JA. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genom. 2013;14:141. - PMC - PubMed

Publication types

LinkOut - more resources