2015 Jun 24
Gene Model Annotations for Drosophila Melanogaster: Impact of High-Throughput Data
Item in Clipboard
Gene Model Annotations for Drosophila Melanogaster: Impact of High-Throughput Data
We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3' UTRs (up to 15-18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.
alternative splice; exon junction; lncRNA; transcription start site; transcriptome.
Copyright © 2015 Matthews et al.
Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
Klp54D gene model was split into two genes. A GBrowse1 view of the Klp54D gene model as it existed in R5.30 (A). The gene (blue) and transcript (orange) annotations were based primarily on gene prediction (yellow). On the basis of high-throughput data, this gene model was split in R5.36 to give Klp54D and CG43324, as shown in an updated GBrowse2 view of this same region, as it exists in R6.03 (B). Below the transcript annotations, modENCODE RNA-Seq exon junctions (blue), aligned cDNA evidence (green), and modENCODE RNA-Seq coverage data for 30 developmental stages spanning early embryogenesis to adulthood are shown from top to bottom. The RNA-Seq expression data show that CG43324 is expressed at a much higher level and in more stages than Klp54D. There is also no RNA-Seq exon junction connecting the two genes. In addition, the annotated 5′ end of CG43324 is supported by RAMPAGE TSS data (not shown). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Alternative transcription start site and 3′ end for
CG31717. A GBrowse2 view of CG31717, as it exists in R6.03, depicting (from top to bottom) modENCODE embryonic transcription start site evidence, FlyBase gene and transcript annotations, aligned cDNA evidence, modENCODE RNA-Seq junctions, and modENCODE stranded RNA-Seq expression profiles for CNS tissues (larval, pupal, and adult head samples) and gonadal tissues (testis, accessory gland, virgin female ovary, and mated female ovary); plus strand signal is shown above the minus strand signal for each RNA-Seq track. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
New long non-coding RNA genes are supported by RNA-Seq data. A GBrowse2 view for a region containing four recently annotated lncRNA genes is shown (R6.03).
CR43132 is supported by RNA-Seq junction and expression data. CR45523, CR45524, and CR45526 are supported by RNA-Seq expression data only; they were identified in a genome-wide scan for intergenic regions with RPKM values of 3 or more. The transcript polarity is determined from the stranded “Gonads and male accessory glands” RNA-Seq expression tracks. CR45523, CR45524, and CG45526 show expression primarily in male testis (red RNA-Seq signal), a pattern common to many of the newly annotated ncRNA genes. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
New ncRNA gene
CR45161 is antisense to fln. CR45161 is a newly annotated antisense gene supported by RNA-Seq expression and junction data. Although it might be mistaken for background transcription in the unstranded “Developmental stage” RNA-Seq expression tracks, its strong transcription on the positive strand is obvious in the stranded “CNS and adult head” RNA-Seq track. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
A subset of possible
AnxB9 transcript isoforms has been annotated. RNA-Seq junction and expression data predict eight alternative splice donors from three different leading 5′ exons, of which four have been used in annotations. Low-frequency junctions have not been annotated. Alternative splicing in the last intron leads to three different protein isoforms. A low-frequency junction at the 3′ end of the gene has also been excluded. Twelve different transcript isoforms are possible using the annotated junctions (32 are possible with all junctions), but only a subset of the possible combinations has been annotated. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
The two nonoverlapping protein isoforms of
klar. A GBrowse2 view of klar is shown, as it exists in R6.03, with nonoverlapping isoforms highlighted in yellow ( klar-RC and -RI do not overlap klar-RD and -RH). The C-terminus of the longer, "upstream" isoforms ( klar-RD and -RH) is sufficient for targeting proteins to lipid droplets, whereas the "KASH" domain present in the "downstream" isoforms ( klar-RC and -RI) is sufficient for targeting to the nuclear envelope (Guo et al. 2005). The "upstream" nonoverlapping isoform is necessary for proper lipid droplet targeting in the embryo. While the KASH domain is necessary for nuclear migration in the embryo and retina, this function is associated with the "full-length" KASH-containing isoforms. The short KASH-containing isoform, which lacks motor interaction domains, is expressed (Western blot, immunofluorescence) and is apparently enriched in nurse cells but is not sufficient to rescue nuclear migration in the retina. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
All figures (7)
Gene Model Annotations for Drosophila Melanogaster: The Rule-Benders
MA Crosby et al.
G3 (Bethesda) 5 (8), 1737-49.
In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified i …
The Drosophila Melanogaster Transcriptome by Paired-End RNA Sequencing
B Daines et al.
Genome Res 21 (2), 315-24.
RNA-seq was used to generate an extensive map of the Drosophila melanogaster transcriptome by broad sampling of 10 developmental stages. In total, 142.2 million uniquely …
Integrating RNA-seq and ChIP-seq Data to Characterize Long Non-Coding RNAs in Drosophila Melanogaster
MJ Chen et al.
BMC Genomics 17, 220.
In this study, we discovered a large number of novel lncRNAs, which suggests that many remain to be identified in D. melanogaster. For the lncRNAs that are known, we impr …
Annotation of the Drosophila Melanogaster Euchromatic Genome: A Systematic Review
S Misra et al.
Genome Biol 3 (12), RESEARCH0083.
Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores …
Using FlyBase, a Database of Drosophila Genes and Genomes
SJ Marygold et al.
Methods Mol Biol 1478, 1-31.
For nearly 25 years, FlyBase (flybase.org) has provided a freely available online database of biological information about Drosophila species, focusing on the model organ …
PubMed Central articles
Adaptation Is Maintained by the Parliament of Genes
TW Scott et al.
Nat Commun 10 (1), 5163.
Fields such as behavioural and evolutionary ecology are built on the assumption that natural selection leads to organisms that behave as if they are trying to maximise th …
Double Triage to Identify Poorly Annotated Genes in Maize: The Missing Link in Community Curation
MK Tello-Ruiz et al.
PLoS One 14 (10), e0224086.
The sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer …
Bridging the Gap Between Reference and Real Transcriptomes
A Morillon et al.
Genome Biol 20 (1), 112.
Genetic, transcriptional, and post-transcriptional variations shape the transcriptome of individual cells, rendering establishing an exhaustive set of reference RNAs a co …
Genome-wide Maps of Ribosomal Occupancy Provide Insights Into Adaptive Evolution and Regulatory Roles of uORFs During Drosophila Development
H Zhang et al.
PLoS Biol 16 (7), e2003903.
Upstream open reading frames (uORFs) play important roles in regulating the main coding DNA sequences (CDSs) via translational repression. Despite their prevalence in the …
Combining RNA-seq Data and Homology-Based Gene Prediction for Plants, Animals and Fungi
J Keilwagen et al.
BMC Bioinformatics 19 (1), 189.
GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under G …
Aminetzach Y. T., Macpherson J. M., Petrov D. A., 2005. Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila. Science 309: 764–767.
Bachmann A., Draga M., Grawe F., Knust E., 2008. On the role of the MAGUK proteins encoded by Drosophila varicose during embryonic and postembryonic development. BMC Dev. Biol. 8: 55.
Balakirev E. S., Ayala F. J., 2004. The β-esterase gene cluster of Drosophila melanogaster: is ψEst-6 a pseudogene, a functional gene, or both? Genetica 121: 165–179.
Batut P., Dobin A., Plessy C., Carninci P., Gingeras T. R., 2013. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23: 169–180.
Behm-Ansmant I., Kashima I., Rehwinkel J., Sauliere J., Wittkopp N., et al. , 2007. mRNA quality control: an ancient machinery recognizes and degrades mRNAs with nonsense codons. FEBS Lett. 581: 2845–2853.
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Drosophila melanogaster / genetics
Molecular Sequence Annotation
RNA, Small Untranslated / chemistry
RNA, Small Untranslated / metabolism
Transcription Initiation Site
LinkOut - more resources
Full Text Sources Molecular Biology Databases