One goal of the Human Proteome Project is to identify at least one protein product for each of the ∼20,000 human protein-coding genes. As of October 2014, however, there are 3564 genes (18%) that have no or insufficient evidence of protein existence (PE), as curated by neXtProt; these comprise 2647 PE2-4 missing proteins and 616 PE5 dubious protein entries. We conducted a systematic examination of the 616 PE5 protein entries using cutting-edge protein structure and function modeling methods. Compared to a random sample of high-confidence PE1 proteins, the putative PE5 proteins were found to be over-represented in the membrane and cell surface proteins and peptides fold families. Detailed functional analyses show that most PE5 proteins, if expressed, would belong to transporters and receptors localized in the plasma membrane compartment. The results suggest that experimental difficulty in identifying membrane-bound proteins and peptides could have precluded their detection in mass spectrometry and that special enrichment techniques with improved sensitivity for membrane proteins could be important for the characterization of the PE5 "dark matter" of the human proteome. Finally, we identify 66 high scoring PE5 protein entries and find that six of them were reported in recent mass spectrometry databases; an illustrative annotation of these six is provided. This work illustrates a new approach to examine the potential folding and function of the dubious proteins comprising PE5, which we will next apply to the far larger group of missing proteins comprising PE2-4.
Keywords: COFACTOR; Human Proteome Project; I-TASSER; PeptideAtlas; missing proteins; neXtprot; protein folding; structure-based function annotation.