The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.
© 2019 Mudge et al.; Published by Cold Spring Harbor Laboratory Press.
GENCODE: Producing a Reference Annotation for ENCODEJ Harrow et al. Genome Biol 7 Suppl 1 (Suppl 1), S4.1-9. PMID 16925838.In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation wi …
GENCODE: The Reference Human Genome Annotation for The ENCODE ProjectJ Harrow et al. Genome Res 22 (9), 1760-74. PMID 22955987.The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validatio …
PhyloCSF: A Comparative Genomics Method to Distinguish Protein Coding and Non-Coding RegionsMF Lin et al. Bioinformatics 27 (13), i275-82. PMID 21685081.We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate tha …
EGASP: The Human ENCODE Genome Annotation Assessment ProjectR Guigó et al. Genome Biol 7 Suppl 1 (Suppl 1), S2.1-31. PMID 16925836. - ReviewThis is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the …
Has the Yo-Yo Stopped? An Assessment of Human Protein-Coding Gene NumberC Southan. Proteomics 4 (6), 1712-26. PMID 15174140. - ReviewSince the identification of approximately 25,000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30,000 and 70,000. …