Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 5 (8), 1721-36

Gene Model Annotations for Drosophila Melanogaster: Impact of High-Throughput Data

Affiliations

Gene Model Annotations for Drosophila Melanogaster: Impact of High-Throughput Data

Beverley B Matthews et al. G3 (Bethesda).

Abstract

We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3' UTRs (up to 15-18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.

Keywords: alternative splice; exon junction; lncRNA; transcription start site; transcriptome.

Figures

Figure 1
Figure 1
Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
Figure 2
Figure 2
The Klp54D gene model was split into two genes. A GBrowse1 view of the Klp54D gene model as it existed in R5.30 (A). The gene (blue) and transcript (orange) annotations were based primarily on gene prediction (yellow). On the basis of high-throughput data, this gene model was split in R5.36 to give Klp54D and CG43324, as shown in an updated GBrowse2 view of this same region, as it exists in R6.03 (B). Below the transcript annotations, modENCODE RNA-Seq exon junctions (blue), aligned cDNA evidence (green), and modENCODE RNA-Seq coverage data for 30 developmental stages spanning early embryogenesis to adulthood are shown from top to bottom. The RNA-Seq expression data show that CG43324 is expressed at a much higher level and in more stages than Klp54D. There is also no RNA-Seq exon junction connecting the two genes. In addition, the annotated 5′ end of CG43324 is supported by RAMPAGE TSS data (not shown). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Figure 3
Figure 3
Alternative transcription start site and 3′ end for CG31717. A GBrowse2 view of CG31717, as it exists in R6.03, depicting (from top to bottom) modENCODE embryonic transcription start site evidence, FlyBase gene and transcript annotations, aligned cDNA evidence, modENCODE RNA-Seq junctions, and modENCODE stranded RNA-Seq expression profiles for CNS tissues (larval, pupal, and adult head samples) and gonadal tissues (testis, accessory gland, virgin female ovary, and mated female ovary); plus strand signal is shown above the minus strand signal for each RNA-Seq track. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Figure 4
Figure 4
New long non-coding RNA genes are supported by RNA-Seq data. A GBrowse2 view for a region containing four recently annotated lncRNA genes is shown (R6.03). CR43132 is supported by RNA-Seq junction and expression data. CR45523, CR45524, and CR45526 are supported by RNA-Seq expression data only; they were identified in a genome-wide scan for intergenic regions with RPKM values of 3 or more. The transcript polarity is determined from the stranded “Gonads and male accessory glands” RNA-Seq expression tracks. CR45523, CR45524, and CG45526 show expression primarily in male testis (red RNA-Seq signal), a pattern common to many of the newly annotated ncRNA genes. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Figure 5
Figure 5
New ncRNA gene CR45161 is antisense to fln. CR45161 is a newly annotated antisense gene supported by RNA-Seq expression and junction data. Although it might be mistaken for background transcription in the unstranded “Developmental stage” RNA-Seq expression tracks, its strong transcription on the positive strand is obvious in the stranded “CNS and adult head” RNA-Seq track. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Figure 6
Figure 6
A subset of possible AnxB9 transcript isoforms has been annotated. RNA-Seq junction and expression data predict eight alternative splice donors from three different leading 5′ exons, of which four have been used in annotations. Low-frequency junctions have not been annotated. Alternative splicing in the last intron leads to three different protein isoforms. A low-frequency junction at the 3′ end of the gene has also been excluded. Twelve different transcript isoforms are possible using the annotated junctions (32 are possible with all junctions), but only a subset of the possible combinations has been annotated. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.
Figure 7
Figure 7
The two nonoverlapping protein isoforms of klar. A GBrowse2 view of klar is shown, as it exists in R6.03, with nonoverlapping isoforms highlighted in yellow (klar-RC and -RI do not overlap klar-RD and -RH). The C-terminus of the longer, "upstream" isoforms (klar-RD and -RH) is sufficient for targeting proteins to lipid droplets, whereas the "KASH" domain present in the "downstream" isoforms (klar-RC and -RI) is sufficient for targeting to the nuclear envelope (Guo et al. 2005). The "upstream" nonoverlapping isoform is necessary for proper lipid droplet targeting in the embryo. While the KASH domain is necessary for nuclear migration in the embryo and retina, this function is associated with the "full-length" KASH-containing isoforms. The short KASH-containing isoform, which lacks motor interaction domains, is expressed (Western blot, immunofluorescence) and is apparently enriched in nurse cells but is not sufficient to rescue nuclear migration in the retina. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse_Tracks.

Similar articles

See all similar articles

Cited by 13 PubMed Central articles

See all "Cited by" articles

References

    1. Aminetzach Y. T., Macpherson J. M., Petrov D. A., 2005. Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila. Science 309: 764–767. - PubMed
    1. Bachmann A., Draga M., Grawe F., Knust E., 2008. On the role of the MAGUK proteins encoded by Drosophila varicose during embryonic and postembryonic development. BMC Dev. Biol. 8: 55. - PMC - PubMed
    1. Balakirev E. S., Ayala F. J., 2004. The β-esterase gene cluster of Drosophila melanogaster: is ψEst-6 a pseudogene, a functional gene, or both? Genetica 121: 165–179. - PubMed
    1. Batut P., Dobin A., Plessy C., Carninci P., Gingeras T. R., 2013. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23: 169–180. - PMC - PubMed
    1. Behm-Ansmant I., Kashima I., Rehwinkel J., Sauliere J., Wittkopp N., et al. , 2007. mRNA quality control: an ancient machinery recognizes and degrades mRNAs with nonsense codons. FEBS Lett. 581: 2845–2853. - PubMed

Publication types

Substances

LinkOut - more resources

Feedback