Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Oct;22(10):2079-87.
doi: 10.1101/gr.139170.112. Epub 2012 Jun 18.

Gene structure in the sea urchin Strongylocentrotus purpuratus based on transcriptome analysis

Affiliations

Gene structure in the sea urchin Strongylocentrotus purpuratus based on transcriptome analysis

Qiang Tu et al. Genome Res. 2012 Oct.

Abstract

A comprehensive transcriptome analysis has been performed on protein-coding RNAs of Strongylocentrotus purpuratus, including 10 different embryonic stages, six feeding larval and metamorphosed juvenile stages, and six adult tissues. In this study, we pooled the transcriptomes from all of these sources and focused on the insights they provide for gene structure in the genome of this recently sequenced model system. The genome had initially been annotated by use of computational gene model prediction algorithms. A large fraction of these predicted genes were recovered in the transcriptome when the reads were mapped to the genome and appropriately filtered and analyzed. However, in a manually curated subset, we discovered that more than half the computational gene model predictions were imperfect, containing errors such as missing exons, prediction of nonexistent exons, erroneous intron/exon boundaries, fusion of adjacent genes, and prediction of multiple genes from single genes. The transcriptome data have been used to provide a systematic upgrade of the gene model predictions throughout the genome, very greatly improving the research usability of the genomic sequence. We have constructed new public databases that incorporate information from the transcriptome analyses. The transcript-based gene model data were used to define average structural parameters for S. purpuratus protein-coding genes. In addition, we constructed a custom sea urchin gene ontology, and assigned about 7000 different annotated transcripts to 24 functional classes. Strong correlations became evident between given functional ontology classes and structural properties, including gene size, exon number, and exon and intron size.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Computational simulation of quantitative variations at different sequencing depths. The ordinate, ratio of FPKM per transcript species in the two data sets compared, is given in log2; the abscissa, mean of the two FPKM values, in log10. (Blue dots) 20 million (M) reads; (green dots) 2M reads; (red dots) 0.2M read. (Vertical dashed line) Average FPKM 5; (horizontal dotted lines) ± twofold change. The plot shows that in the 20M read data set, prevalence estimations for almost all mRNAs over FPKM 5 are within twofold.
Figure 2.
Figure 2.
Length distributions of protein-coding genes and their components. Essentially these plots are smoothed versions of a histogram where the ordinate represents the frequency of the given length in base pairs. All distributions have very long tails, and the plots only show part of the distributions: (A) genes, 0–100 kb; (B) introns and mRNA, 0–10 kb; (C) UTRs and CDS, 0–5 kb; (D) exons, 0–1 kb.
Figure 3.
Figure 3.
Lengths of exons and introns with respect to their relative positions in genes. (A) Labeling method for introns and exons used in the following panels. (B,C) Average length of exons and introns diagrammed in A. (D,E) Average length of each exon and intron in all genes containing 10 exons.
Figure 4.
Figure 4.
Discrepant predicted and observed gene structure displayed in the IGV genome browser. A selectable variety of aligned features is shown in horizontal tracks with the feature label to the left: Repeat sequences (gray; shows the number of matches using 76-bp sequence windows in the whole genome, using Bowtie with the same parameters as when mapping the reads); Gap (gray; sequence regions of the genome assembly that lie in gaps and are therefore undetermined; several short gaps are shown in A); GLEAN model (red; the original gene model predicted by the GLEAN method); RNA-seq gene models (blue; the models produced by this study; the blank terminal regions are UTRs); Coverage (green; a graphical presentation of the number of sequencing reads that align at a particular location); Reads (gray; the alignment of individual reads to the genome sequence). (Orange arrows) Individual RNA sequence-derived exons. (A) The genomic structure of the gene blimp1. The overall structure of the GLEAN gene model is correct except longer UTRs are recovered and an alternatively spliced isoform that uses a distant 5′ exon is recovered. (B) The genomic structure of the gene hnf6. The GLEAN model predicted an incorrect exon1/intron1 boundary, and the 3′ exon is not supported by sequence. The correct 3′ exons and two isoforms were identified from the RNA sequence data.
Figure 5.
Figure 5.
Numbers of gene models associated with major functional classes. The distribution is based on the custom sea urchin ontology discussed in the text.
Figure 6.
Figure 6.
Gene structure parameters for individual ontological classes. The four panels show average gene length, exon length, intron length, and exon number. (Black horizontal lines) The average value of the feature in the whole gene set. The “Unclassified” class refers to gene models that were not included in these ontological classes. The “Novel” class refers to gene models newly identified in this study as described in the text; these tend to be atypically small genes with few exons.

Similar articles

Cited by

References

    1. Blencowe BJ, Ahmad S, Lee LJ 2009. Current-generation high-throughput sequencing: Deepening insights into mammalian transcriptomes. Genes Dev 23: 1379–1386 - PubMed
    1. Bolouri H, Davidson EH 2003. Transcriptional regulatory cascades in development: Initial rates, not steady state, determine network kinetics. Proc Natl Acad Sci 100: 9371–9376 - PMC - PubMed
    1. Bradnam KR, Korf I 2008. Longer first introns are a general property of eukaryotic gene structure. PLoS ONE 3: e3093 doi: 10.1371/journal.pone.0003093 - PMC - PubMed
    1. Cameron RA, Samanta M, Yuan A, He D, Davidson EH 2009. SpBase: The sea urchin genome database and web site. Nucleic Acids Res 37: D750–D754 - PMC - PubMed
    1. Davidson EH 1986. Gene activity in early development, 3rd ed. Academic Press, Orlando, FL.

Publication types

Substances

Associated data

LinkOut - more resources