Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 29 (12), 2073-2087

Discovery of High-Confidence Human Protein-Coding Genes and Exons by Whole-Genome PhyloCSF Helps Elucidate 118 GWAS Loci

Discovery of High-Confidence Human Protein-Coding Genes and Exons by Whole-Genome PhyloCSF Helps Elucidate 118 GWAS Loci

Jonathan M Mudge et al. Genome Res.

Abstract

The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.

Figures

Figure 1.
Figure 1.
Computing PhyloCSF Candidate Coding Regions (PCCRs). (A) Flow chart of overall process. Numbers in orange are counts for the human hg38 assembly relative to the GENCODE v23 gene set. The hypothetical browser image at the bottom illustrates how the PhyloCSF Regions list is pruned to define PCCRs. In the vicinity of a coding gene (blue) and a pseudogene (pink), we initially have a set of intervals in each of the six possible reading frames (“PhyloCSF Regions”) that are more likely to be in the coding state than noncoding state of the HMM (gray-scale intervals in the six “PhyloCSF*Regns” tracks). We then exclude any that overlap known coding genes in the same frame (1) or anti-sense frame (2) or that overlap known pseudogenes in any frame on either strand (3). Next, we exclude regions less than nine codons long (4) and regions predicted by our antisense SVM to be likely antisense regions (5). Finally, we add back nonoverlapping fragments of PhyloCSF Regions that partly overlap annotations because these could be extensions of known exons (6). The resulting PCCRs are shown in green. These sometimes overlap known coding regions, and this is an indication that the PhyloCSF signal is in a different frame from the annotated one (7). The resulting PCCRs are then ranked by an SVM and investigated by expert manual annotators to find novel coding regions and pseudogenes. (B) Performance on previously annotated coding genes. Column chart on the left shows the fraction (93%) of GENCODE v23 coding genes that overlap at least one PhyloCSF Region; the remaining 7% could not have been identified by our workflow. Density plot on the right measures the efficiency of our PCCR-ranking SVM by showing SVM scores for all PCCRs (black) and scores of the highest-scoring PhyloCSF Region that overlaps each GENCODE v23 coding gene that overlaps at least one PhyloCSF Region (red). For 92% of such coding genes, the score is in the 99th percentile of scores of PCCRs (shaded area), indicating that manual examination of the top-ranked 1% of PCCRs would have uncovered each of these coding genes if it had not been known previously, and suggesting that most true novel coding genes could be identified by examining the best ranking PCCRs. (C) PhyloCSF tracks in UCSC Genome Browser showing the “−” strand of C. elegans Chromosome X. Upper six green and red “PhyloCSFraw” tracks show the raw PhyloCSF score for each codon in each of six reading frames. The black “PhyloCSF power” track indicates the relative branch length of the local alignment, a measure of the statistical power available to PhyloCSF; there is near full alignment for the first approximately three-fourths of the track, but then there are fewer aligned species for the remaining one-fourth. Codons having relative branch length less than 0.1 show no scores. The next six green and red “PhyloCSF” tracks show the PhyloCSF scores smoothed by the HMM. The six “PhyloCSF*Regns” tracks show PhyloCSF Regions, with gray scale indicating the maximum probability of coding. The “PhyloCSF novel” track shows the PCCRs in all six frames combined into a single track with green and red intervals indicating the plus and minus strands, respectively, and with the rank of the region within the list of PCCRs shown next to the region, with lower ranks indicating stronger likelihood of coding. The two “Splice Pred” tracks show splice donor (green) and acceptor (red) predictions at GT and AG dinucleotides, respectively, on the plus and minus strands, with the height of each bar indicating the strength of the splice prediction. In the example shown, the tracks allow us to conjecture that there is a novel coding exon on the minus strand roughly coinciding with the 3083rd PCCR (1), extending from the ATG indicated by the small green rectangle in the third base position track at the top (2) up to the green splice donor prediction in the “SplicePred−” track (3).
Figure 2.
Figure 2.
Novel protein-coding loci. Browser images show CDSs (open green rectangles), UTRs (pink), supporting PCCRs (red), top rank (black), cDNA evidence (brown), and RNA-seq–supported introns (blue rectangles). Additional transcript models omitted for clarity. Multispecies protein alignments showing conservation of complete ORFs are in Supplemental Figure S4. (A) Novel coding gene SMIM31, previously a cDNA-supported GENCODE lincRNA, was changed to protein coding without a change of transcript structure owing to a 71-aa CDS (ENST00000507311) conserved to coelacanth. The protein-coding cDNA-supported ortholog was added to mouse GENCODE (Smim31). PhyloCSF does not detect coding potential in the second coding exon, but multispecies protein alignment and preponderance of 3-mer indels provide evidence this exon is coding. Human Protein Atlas (HPA) RNA-seq and human and mouse FANTOM5 CAGE data show high transcription in gastrointestinal tissues. (B) Novel coding gene C10orf143 was previously a GENCODE lncRNA (LINC00959), with two cDNA-derived models (ENST00000647406 and ENST00000456581). Discovery of the 108-aa CDS required adding a transcript model (ENST00000637128), supported by Intropolis short-read data. The original lncRNA transcripts have been reannotated as nonsense-mediated decay targets (purple ORFs), based on a premature stop codon in a cassette exon. The orthologous cDNA-supported mouse locus had previously been recognized as protein coding (9430038I01Rik). The gene has a broad expression profile in both species. (C) CCDC201 is a novel human gene with a 187-aa CDS conserved to birds, previously missed owing to lack of spliced cDNA or EST evidence. The ancestral stop codon has been lost in rodents, adding a 30-aa extension in novel mouse protein-coding gene ENSMUSG00000087512. Introns are supported by Intropolis short-read RNA-seq, limited to female reproductive tissues and certain developmental cells. Mouse ENCODE RNA-seq supports placenta and ovary expression only, and the mouse locus (in the guise of a ncRNA) had previously been identified as a target for the germ cell–specific transcription factor Figla (Joshi et al. 2007). (D) H2BE1 is a novel histone HB2 family member protein-coding gene with a 122-aa CDS (model ENST00000644661), whose first exon was identified in this study. Intropolis supports the transcript structure, with expression limited to oocytes and embryonic cells (e.g., SRR499827). Human FANTOM5 CAGE data lacks experiments from developmental stages, which may explain the absence of TSS evidence. Overlapping model ENST00000222388 had previously been annotated as an alternative transcript of ABCF2 (ancestral CDS represented by model ENST00000287844) based on cDNA AL050291, with putative translation in the shared exon following the coding frame of ABCF2. PhyloCSF indicates that the 122-aa CDS is translated in a different frame, so the translation of ENST00000222388 is potentially spurious. Although the 122-aa CDS is conserved to birds, the locus has apparently been lost in rodents. There is no evidence for transcriptional connectivity between the orthologous Ensembl chicken models ABCF2 and ENSGALG00000013346 (bottom). ENST00000222388 has been reclassified as a “readthrough” transcript, and Intropolis data indicate that such readthrough between human ABCF2 and H2BE1 is rare. (E) TMEM274P is a novel human unitary pseudogene, orthologous to novel mouse protein-coding gene Tmem274. CDS alignments to RefSeq models such as scallop LOC110448246 and trichoplax XP_002113670.1 suggest this gene may predate vertebrate evolution, although orthology is presumptive owing to lack of synteny beyond coelacanth. The gene has at best weak expression data in all species examined, but all but one of the mouse splice junctions is supported by minimal ENCODE RNA-seq data from pooled sources, and all splice sites display mammalian conservation. An alignment of human (hum) to chimp (pan), with outgroups mouse (mus) and zebrafish (zeb), shows that human has a premature stop codon that is not a known SNP in the fourth exon of the ancestral CDS (red asterisk in diagram and alignment) and has also lost the second coding exon (large gap in human sequence); both events are unique to human. The zebrafish sequence in the alignment is from XP_017212190, and the chimp translation is from the genome sequence.
Figure 3.
Figure 3.
Protein-altering disease variants. (A) Chromosomal positions and strength of association for the 118 SNVs in newly annotated CDSs that were previously found to be significantly associated with diseases or other traits, with the trait abbreviation from Supplemental Data S5 listed for the 40 most significant associations. (B) Novel coding sequence added to human TJP2 locus includes an eye disease–associated variant. Previous GENCODE annotation represented by models ENST00000539225, ENST00000535702, ENST00000377245, and ENST00000348208. Additional transcriptional complexity omitted for clarity. PhyloCSF PCCRs indicated the presence of two additional coding exons (dotted box and inset) that led to annotation of novel coding transcript model ENST00000636438, which lacks cDNA or EST support but whose intron is confidently supported by short read data in Intropolis (blue rectangle) mostly from a retinal study (Farkas et al. 2013), and whose TSS (P1) is supported by FANTOM5 CAGE data, limited to retina and eye (data from ZENBU browser, precisely redrawn for clarity; scores represent sequence read counts, with zeros for the next three experiments included for comparison). In contrast, TSSs P2 and P3 have negligible CAGE support for eye expression, with profiles dominated by monocyte and central nervous system expression. FANTOM5 CAGE also shows eye-specific expression for an equivalent mouse model added as part of this study, also supported by eye-experiment ESTs (e.g., BU505208.1). The second coding exon added to human GENCODE contains GWAS variant rs11145465, identified in a study of refractive error and myopia with a P-value of 7 × 10−9 (Verhoeven et al. 2013). In that study, the variant had been interpreted as noncoding based on RefSeq annotation, but it can now be reclassified as a missense mutation of an amino acid that is perfectly conserved in the mammal and avian clades. (C) Regional association plot for eye disease. All SNPs in an 800-kb window with their strength of association with refractive error and myopia in a more recent study (Tedja et al. 2018) show that rs11145465 has the strongest association. The positions of the novel coding exons of ENST00000636438 have been added in red.
Figure 4.
Figure 4.
Potential novel CDSs in other species. Browser images show proposed novel CDSs (cyan) suggested by PCCRs (green/red for ± strand; rank next to region), smoothed PhyloCSF browser tracks, splice site predictions where useful (green donor, red acceptor, height indicating prediction strength), and ATG (green) and stop (red) codons. Supplemental Figure S6 has color-coded alignments for each example. (A) A cluster of three PCCRs in the 5′ UTR of D. melanogaster nudE suggest there is a single-exon novel protein-coding gene or an additional nudE cistron with ORF at positions 9898731–9899168. Although there is no PhyloCSF signal in the first 28 codons, the high frame conservation despite several indels provides ample evidence of purifying selection for protein-coding function. (B) A PCCR just 5′ of an exon of D. melanogaster transcript F of CG33143 suggests that there is a novel coding transcript including an exon 173 nt longer than the annotated exon. This exon includes an in-frame TAG stop codon, suggesting translational stop codon readthrough. We have previously estimated that ∼6% of D. melanogaster genes undergo stop codon readthrough (Jungreis et al. 2016). The stop codon is perfectly conserved and is followed immediately by a cytosine residue, both of which are known correlates of readthrough. (C) A large cluster of PCCRs on the “−” strand of C. elegans Chromosome I suggests there is a 1271-amino-acid single-exon gene with ORF at positions 2054512–2058327. There is no alignment for a few codons on each end of the PhyloCSF signal, so to construct the putative ORF, we have extended the region 5′ to the nearest ATG and 3′ to the nearest stop codon. (D) Three PCCRs within an intron of C. elegans gene WBGene00006792 (unc-58) shown on the “−” strand of Chromosome X suggest alternative start exons for that gene. The coding region of each of these putative exons begins with a perfectly conserved ATG and ends at a perfectly conserved GT having high splice-prediction score. All three end with a 1-nt partial codon, which allows them to splice to the next exon of transcript T06H11.1b while preserving the reading frame. (E) A PCCR in A. gambiae suggests that 22539177–22539650 on the “−” strand of Chromosome 2L is protein coding, forming either a novel gene or the first coding exon of the previously incompletely annotated gene AGAP005849. Subsequent curation confirmed the latter. Frame conservation provides strong evidence of coding function in the early portion of the putative transcript where the PhyloCSF signal is weak. (F) A cluster of three PCCRs in an intron of A. gambiae gene AGAP011962 suggests an additional coding exon at positions 35635374–35635874 of Chromosome 3L, confirmed through subsequent curation to be part of a previously missed alternative transcript.

Similar articles

References

    1. Abascal F, Juan D, Jungreis I, Martinez L, Rigau M, Rodriguez JM, Vazquez J, Tress ML. 2018. Loose ends: Almost one in five human genes still have unresolved coding status. Nucleic Acids Res 46: 7070–7084. 10.1093/nar/gky587 - DOI - PMC - PubMed
    1. Akimoto C, Sakashita E, Kasashima K, Kuroiwa K, Tominaga K, Hamamoto T, Endo H. 2013. Translational repression of the McKusick–Kaufman syndrome transcript by unique upstream open reading frames encoding mitochondrial proteins with alternative polyadenylation sites. Biochim Biophys Acta 1830: 2728–2738. 10.1016/j.bbagen.2012.12.010 - DOI - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Andreev DE, O'Connor PBF, Fahey C, Kenny EM, Terenin IM, Dmitriev SE, Cormican P, Morris DW, Shatsky IN, Baranov PV. 2015. Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. eLife 4: e03971 10.7554/eLife.03971 - DOI - PMC - PubMed
    1. Bazzini AA, Johnstone TG, Christiano R, Mackowiak SD, Obermayer B, Fleming ES, Vejnar CE, Lee MT, Rajewsky N, Walther TC, et al. 2014. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J 33: 981–993. 10.1002/embj.201488411 - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback