Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar;22(3):577-91.
doi: 10.1101/gr.133009.111. Epub 2011 Nov 22.

Systematic Identification of Long Noncoding RNAs Expressed During Zebrafish Embryogenesis

Affiliations
Free PMC article

Systematic Identification of Long Noncoding RNAs Expressed During Zebrafish Embryogenesis

Andrea Pauli et al. Genome Res. .
Free PMC article

Abstract

Long noncoding RNAs (lncRNAs) comprise a diverse class of transcripts that structurally resemble mRNAs but do not encode proteins. Recent genome-wide studies in humans and the mouse have annotated lncRNAs expressed in cell lines and adult tissues, but a systematic analysis of lncRNAs expressed during vertebrate embryogenesis has been elusive. To identify lncRNAs with potential functions in vertebrate embryogenesis, we performed a time-series of RNA-seq experiments at eight stages during early zebrafish development. We reconstructed 56,535 high-confidence transcripts in 28,912 loci, recovering the vast majority of expressed RefSeq transcripts while identifying thousands of novel isoforms and expressed loci. We defined a stringent set of 1133 noncoding multi-exonic transcripts expressed during embryogenesis. These include long intergenic ncRNAs (lincRNAs), intronic overlapping lncRNAs, exonic antisense overlapping lncRNAs, and precursors for small RNAs (sRNAs). Zebrafish lncRNAs share many of the characteristics of their mammalian counterparts: relatively short length, low exon number, low expression, and conservation levels comparable to that of introns. Subsets of lncRNAs carry chromatin signatures characteristic of genes with developmental functions. The temporal expression profile of lncRNAs revealed two novel properties: lncRNAs are expressed in narrower time windows than are protein-coding genes and are specifically enriched in early-stage embryos. In addition, several lncRNAs show tissue-specific expression and distinct subcellular localization patterns. Integrative computational analyses associated individual lncRNAs with specific pathways and functions, ranging from cell cycle regulation to morphogenesis. Our study provides the first systematic identification of lncRNAs in a vertebrate embryo and forms the foundation for future genetic, genomic, and evolutionary studies.

Figures

Figure 1.
Figure 1.
Overview of the RNA-seq–based embryonic transcriptome assembly. (A) Overview of the RNA-seq–based transcript reconstruction pipeline that was employed to identify embryonically expressed transcripts in zebrafish. Stage-specific transcriptomes were reconstructed from a time-series of eight embryonic stages: two to four cell, 1000 cell, dome, shield, bud, 28 h post fertilization (hpf), 48 hpf, and 120 hpf. Stage-specific drawings of representative embryos are adapted from Kimmel et al. (1995) (with permission from Wiley © 1995). A schematic outline of the process of transcriptome reconstruction is shown at the bottom for three genes. Reads were mapped to either the + (blue) or – (red) strand using TopHat. Gaps inferred from mapping each of the two paired-end reads are indicated as dashed gray lines; dashed black arrows indicate splice-junctions inferred from a gap in mapping of a single read; and the deduced final transcript structures reconstructed by Scripture or Cufflinks are depicted at the bottom. (B) Overlap between loci from the RNA-seq–based embryonic transcriptome assembly (blue) and previously annotated genes (gray): RefSeq genes (left) and Ensembl loci >160 bp (right). The majority of known loci (84% of RefSeq loci and 74% of Ensembl loci >160 bp) are recovered in the embryonic transcriptome. Note that the number of loci in the Ensembl transcriptome is based on comparison with loci of the embryonic transcriptome (which were used as reference), which reduced the number of 27,751 Ensembl loci (>160 bp) to 26,587.
Figure 2.
Figure 2.
Overview of the stringent filtering pipeline that defined a conservative set of 1,133 lncRNAs. (A) Filters at a glance: overview of classification criteria used to define noncoding transcripts. (B) Detailed outline of the filtering pipeline that defined a conservative set of 1133 multi-exonic, embryonically expressed lncRNAs. The following filtering criteria were used: (1) Phylogenetic Codon Substitution Frequency (PhyloCSF) score <20 (left branch of the top node) or rescue by the antisense pipeline (right branch of the top node [dashed lines]: PhyloCSFsense < 300 and PhyloCSFsense < PhyloCSFanti and highest scoring region [HSR] overlapping with an exon on the opposite strand); (2) no known protein homologs based on blastx, blastp, and HMMER; (3) maximal ORF (ORFmax) <100 aa (transcripts with alignments [complete branch length (CBL) > 0]) or <30 aa (transcripts without alignments [CBL = 0]); and (4) no sense-overlap with any protein-coding transcript. At each step, a green arrow denotes the transcripts that passed the filter; a red arrow, those that were removed. Black bold numbers indicate the number of transcripts that passed the filter. Blue boxes highlight the number of transcripts that passed all filters and are considered noncoding (1133 lncRNAs in 859 loci).
Figure 3.
Figure 3.
Classification of lncRNAs. Numbers of lncRNAs in each of the three main classes, as defined by their genomic location relative to neighboring or overlapping genes. Intergenic lncRNAs (blue; lincRNAs) have no overlap with any gene. lncRNAs with intronic overlap (green) are defined as loci that have overlap with another transcribed locus but no exon–exon overlap (no overlap between the mature lncRNA transcript with exons of the overlapping locus). They are on either the same or the opposite strand relative to the overlapping gene and can be partitioned into intronic contained lncRNAs (incs, light green; the lncRNA is contained within the transcribed region of another locus), completely overlapping lncRNAs (concs, green; the other locus is contained within the transcribed region of the lncRNA locus), and partially overlapping lncRNAs (poncs, dark green; neither inc nor conc, but at least one exon of the lncRNA has overlap with an intron of another locus). LncRNAs with antisense exonic overlap (red) have at least one exon that overlaps with an exon of a protein-coding transcript on the opposite strand; they can be partitioned into those identified via the general pipeline (PhyloCSF < 20, light red) and those rescued via the antisense pipeline (20 < PhyloCSF < 300, dark red). A scheme of the position of the lncRNA gene (in color) relative to neighboring or overlapping gene(s) (black) is shown at the bottom.
Figure 4.
Figure 4.
LncRNAs are shorter, less conserved, and expressed at lower levels than protein-coding genes. (A) Transcript length (a), number of exons (b), and maximum ORF length (ORFmax) (c) of the 1133 lncRNAs (top row) and of the 1133 lncRNAs (blue) in comparison to protein-coding transcripts (44,810 transcripts with PhyloCSF > 50; gray; bottom row). LncRNAs are generally shorter, have fewer exons, and contain shorter ORFs than protein-coding transcripts. Note that this might be an underestimation of the actual size of lncRNAs due to a potentially more incomplete assembly of low-expressed transcripts. (B) Comparison of the expression levels of lncRNA loci (859) and protein-coding loci (19,592 loci with PhyloCSF >50), plotted as fragments per kilobase of exon per million fragments mapped (FPKM). LncRNA loci are expressed at approximately 10-fold lower levels than the majority of protein-coding loci. (C) Comparison of the alignment quality across the locus of interest, assessed by two alternative measurements of the branch lengths present in the alignment. Branch lengths are measured on a scale from 0 to 1, where 0 indicates no alignments over the region of interest and 1 indicates the presence of 100% of sequence alignments. The branch length (BL) score refers to the alignment quality of the region that scores highest in PhyloCSF (the highest scoring region [HSR]; left). The complete branch length (CBL) score refers to the alignment quality over the entire length of the transcript (right). In the case of noncoding genes, alignments are poorer for the HSRs than for the entire gene length (BL scores < CBL scores). The reverse is true for protein-coding genes, which tend to have the best alignments over the HSRs (BL scores close to one). The values of the median (yellow dashed line) and mean are indicated in all panels.
Figure 5.
Figure 5.
LncRNA genes carry chromatin marks associated with developmental regulators. Shown are the fractions of promoters (±500 bp relative to the transcription start site [TSS]) that are marked by a specific histone modification at shield stage. Histone marks were assessed by ChIP-seq experiments and analyzed for the presence of H3K4me3 only, H3K27me3 only, and both H3K4me3 and H3K27me3. RefSeq genes (gray bars); protein-coding loci (black bars); lncRNA loci (blue bars). (A) Marked fractions of promoters considering all loci. (B) Marked fractions of promoters only considering loci expressed at shield stage. In B, protein-coding loci were sampled from expression levels comparable to the set of 145 lncRNA loci expressed at shield (see Methods). Error bars, 1 SD of 10,000-times sampling. (C) Example chromatin profiles for a shield-expressed lincRNA gene marked by H3K4me3 (top) and for a lncRNA locus (overlapping the protein-coding genes eng2a and insig1) marked by both H3K4me3 and H3K27me3 (bottom). Signals are shown as the number of ChIP-seq reads that aligned overlapping in a 5-bp window (note that the y-axis ranges from 0–12).
Figure 6.
Figure 6.
Temporal expression profiles of lncRNA genes compared to protein-coding genes. (A) Dynamic changes in expression profiles of loci (rows) across eight embryonic stages (columns). Heatmaps of 859 lncRNA loci (blue; left) and 23,462 protein-coding loci (gray; right) show normalized expression values (the sum of expression across all stages per locus is set to one). Three main expression patterns can be distinguished: “cleavage stages” (transcripts present in two- to four-cell-stage embryos), “zygotic” (transcripts enriched during blastula and gastrula stages and absent/only present at low levels at the two- to four-cell stage), and “larval” (transcripts induced only 1 d after fertilization). Note that the fraction of parentally provided (cleavage stage) transcripts is higher for lncRNAs than for protein-coding transcripts. (B) Temporal restriction of expression. Shown are distributions of Shannon entropy-based temporal specificity scores that were calculated for distinct classes of lncRNA loci and protein-coding loci (see Methods): exonic overlapping antisense lncRNAs (red), intronic overlapping lncRNAs (green), intergenic lncRNAs (blue), all protein-coding loci (black), and protein-coding loci of similar expression levels as lncRNA loci (gray; 95% confidence interval based on 10,000-times sampling). All classes of lncRNA loci display higher temporal specificity than protein-coding loci. (C) Expression-based association matrix of 835 lncRNA loci (rows) and functional gene sets (columns), derived from gene set enrichment analysis (GSEA). (Red) Positive correlation; (blue) negative correlation; (white) no correlation. Rows corresponding to lncRNAs whose RNA expression pattern is shown by in situ hybridization in Figure 7 are indicated on the left. Black boxes highlight two clusters associated with functions in signaling (cluster 2) and development (cluster 6). (Top right) The most enriched GO terms per cluster in comparison to all other clusters. (Bottom right) The 10 most enriched GO terms in the two boxed clusters in comparison to all other clusters, ranked by their –log10(P-values).
Figure 7.
Figure 7.
LncRNAs show tissue-specific and subcellularly restricted expression patterns. (A) Examples of lncRNAs with cell type–specific expression patterns at different stages of embryogenesis. Shown are in situ hybridization images with probes specific to the indicated lncRNAs. Expression is observed (i) in a two-cell stage embryo (cytoplasmic streaming from the yolk), (ii) in developing muscles, and (iii,iv) in distinct cells in the developing nervous system. (i,ii) Lateral views (anterior toward the left in ii); (iii,iv) dorsal views, anterior toward the left. (B) Examples of subcellularly localized lncRNAs. Bottom panels in i and ii (middle panel in iii, right) show a counterstain of the in situ image with the DNA-dye OliGreen (green). Black arrowheads point to subcellularly localized RNAs; white arrowheads point to the same position in the OliGreen-stained images. (i) Nuclear enrichment and association with chromatin (hoxAa-lncRNA); (top) 16-cell stage embryo with mitotically dividing nuclei; (middle, bottom) four-cell stage embryo. (ii) Enrichment at the nuclear periphery (mprip_lncRNA): (top) overview of a bud-stage embryo, showing accumulation of the lncRNA around nuclei of the yolk syncytial layer (YSL); (middle, bottom) close-up view of a dissected portion of the embryo shown in the top panel. Note that the lncRNA is specifically enriched around the large nuclei of the YSL but not around the small nuclei of the overlying cell-sheet. (iii) Enrichment at the myoseptum, the boundary between two adjacent myotubes (myo18a-lncRNA; top left, right); dystrophin mRNA (middle left) is a known marker of the myoseptum (Bassett 2003); myzh1.1 (myosin heavy chain) mRNA (bottom left) is detected throughout the somites (not subcellularly localized); and (right) myo18a-lncRNA (red, in situ) is enriched at the myoseptum, which is characterized by the absence of nuclei (regions of no green in the OliGreen-stained panel). Note that there is no overlap between red and green in the merge panel.

Similar articles

See all similar articles

Cited by 342 articles

See all "Cited by" articles

Publication types

Associated data

LinkOut - more resources

Feedback