Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug;27(8):1371-1383.
doi: 10.1101/gr.208652.116. Epub 2017 May 9.

The identification and functional annotation of RNA structures conserved in vertebrates

Affiliations

The identification and functional annotation of RNA structures conserved in vertebrates

Stefan E Seemann et al. Genome Res. 2017 Aug.

Abstract

Structured elements of RNA molecules are essential in, e.g., RNA stabilization, localization, and protein interaction, and their conservation across species suggests a common functional role. We computationally screened vertebrate genomes for conserved RNA structures (CRSs), leveraging structure-based, rather than sequence-based, alignments. After careful correction for sequence identity and GC content, we predict ∼516,000 human genomic regions containing CRSs. We find that a substantial fraction of human-mouse CRS regions (1) colocalize consistently with binding sites of the same RNA binding proteins (RBPs) or (2) are transcribed in corresponding tissues. Additionally, a CaptureSeq experiment revealed expression of many of our CRS regions in human fetal brain, including 662 novel ones. For selected human and mouse candidate pairs, qRT-PCR and in vitro RNA structure probing supported both shared expression and shared structure despite low abundance and low sequence identity. About 30,000 CRS regions are located near coding or long noncoding RNA genes or within enhancers. Structured (CRS overlapping) enhancer RNAs and extended 3' ends have significantly increased expression levels over their nonstructured counterparts. Our findings of transcribed uncharacterized regulatory regions that contain CRSs support their RNA-mediated functionality.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Performance assessment, genomic distribution, and conservation of CRS predictions. (A) Mean FDR of CRSs for different CMfinder score (pscore) cutoffs and GC-content intervals. FDR calculation is based on SISSIz (Gesell and Washietl 2008) simulated alignments. The large decrease in FDR observed between pscore cutoff 40 and 50 motivated us to base all further analyses on pscore ≥ 50. The mean FDR covering all ranges of GC content is 15.8. (B) GC content of CRS region alignments. (C) Fold enrichment of CRS regions for biotypes and previous computational RNA structure screens in vertebrates (blue). (D) Absolute CRS region coverage of biotypes. (E) Relative position of CRS regions over noncoding biotypes presented as fold enrichment of CRS regions in bins, each 5% (considering only exons) of the feature's (UTR or gene) length. The trend of decreasing number of structures from 5′ to 3′ is common to 5′ UTRs and lncRNAs. (F) Number of CRSs conserved in the 100-species tree. (G) Average pairwise sequence identity (SI) of CRS region alignments over the 17 representative genomes in the phylogenetic tree. (H) Realignment (calculated as in Torarinsson et al. 2008) compares the 17-species MULTIZ alignment blocks (hg18) to corresponding structure-based alignments of CRS regions (17-way subalignments extracted from our 100-species/hg38 results, as described in Methods). (I) Species number of CRS region alignments. In B,G, and I, the CRSs of highest GC content, SI and species number, respectively, are used as representatives of a CRS region, and in H the CRSs of lowest realignment are used as representatives. The biotypes in G, H, and I are sorted by their median SI.
Figure 2.
Figure 2.
Human and mouse conservation of CRS regions is reflected by binding sites of RBPs and expression. (A) Seven of 10 RBPs display enrichment of CRSs in conserved binding sites (P < 10−7, FET). Significant enrichments are colored dark blue; light blue were not significant. (B) A relatively large number of CRSs (146,670) are expressed in both human and mouse (red bars) over four tissues (heart, liver, diencephalon/forebrain, and cerebellum/hindbrain) with comparable total RNA-seq data (Methods). In total, 157,136 CRSs are expressed in both human and mouse in total RNA-seq or poly(A) RNA-seq (Supplemental Fig. S7). CRSs with an empirical P-value < 0.01 were assigned an “expressed” state. We considered only 433,327 of 543,390 human–mouse conserved CRSs that have the same biotype in both species. Note that “5′ extension” and “3′ extension” refer to 2-kb regions upstream of and downstream from UTRs and lncRNAs; UTRs themselves are included in “mRNA.” (C) Expression correlation between human and mouse for different biotypes was measured by Pearson's correlation coefficient r of expression levels in poly(A) RNA-seq (six tissues: testis, liver, kidney, heart, cerebellum, and brain). “Background” is sampled over the input MA blocks with human–mouse conservation not overlapping the other biotypes. The number on the left of violin plots is total number of measured CRSs with expression in at least two tissues, and the number on the right side is number of CRSs with r > 0.8.
Figure 3.
Figure 3.
CaptureSeq and qRT-PCR show conserved expression of CRSs. (A) ROC curve of CRS region detection in brain based on public poly(A) RNA-seq defined by different CPM/RLE cutoffs (numbers on the curves) using the CRS region detection through CaptureSeq in fetal brain as the gold standard. (B) Expression profiles of 23 CRS regions were measured with qRT-PCR (normalized by CRS regions) in seven tissues in both human and mouse. The CRS regions have weakly conserved primary sequences and were expressed in the CaptureSeq (P < 0.1). The CRS regions are sorted by decreasing Pearson's correlation coefficients of expression profiles between human and mouse. (C) The CRS region C3381920 is located in the 3′ end of the lncRNA AC07304.25. Despite no expression in brain in publicly available total and poly(A) RNA-seq data, it showed up in human brain in both CaptureSeq and qRT-PCR. Common expression in human and mouse was observed in the gastrointestinal tract (small intestine and colon; see B). Region C3381920 contains the CRS M0653745 whose structure is highly conserved between human and mouse. Color code in human and mouse structures is base-pair probabilities calculated by the Vienna RNA package (Lorenz et al. 2011).
Figure 4.
Figure 4.
In vitro RNA structure probing in human and mouse shows conserved structure of CRS M1695693. FDR is 11.0% and SI of the nine-species (filtered from the 17-species tree) structural alignment is 48% (45% between human and mouse). The CRS is located between the 3′ UTRs of HOMER2 (minus strand) and WHAMM (plus strand; Chr 15: 8284671–82846804). It overlaps a DNase hypersensitive site (DHS) (ENCODE) and has the typical chromatin signatures of enhancers, namely, enrichment of H3K4me1 and reduced enrichment of H3K4me3, all indicators for a transcribed regulatory region. However, CAGE data from FANTOM5 did not support this hypothesis; instead, poly(A) site clusters (Gruber et al. 2016) suggest an extended 3′ UTR of HOMER2. (A) Genomic tracks. (B) Structure probing results in human and mouse, where red marks base-paired nucleotides (ds), and green and blue mark single-stranded nucleotides (ss). (C) CMfinder's structural alignment, predicted consensus RNA secondary structure, and predicted individual structures in human and mouse as dot-bracket notation. The probing results are overlapped with the in silico predictions by their color code.
Figure 5.
Figure 5.
Coverage and expression of CRS regions in gene regulatory regions. The figure's three rows describe regions surrounding (1) enhancers (AD), (2) most distal TSS of mRNAs/lncRNAs (EH), and (3) most distal 3′ end of mRNAs/lncRNAs (IK), respectively. (A,E,I) Plot density of CRS regions near those features: counts in 50-bp windows normalized by the number of features. “Predicted” curves (orange) reflect all CRS regions; “transcribed” curves (blue) reflect the subset supported by unannotated transcription boundaries. Lower subpanels show estimated FDRs (mean, SD) of those predictions. All other panels are based on the “transcribed” subset; for details, see Methods section “Definition of Gene Regulatory Regions” and Supplemental Figure S11. In summary, expression is based on the following: (B,C) CAGE TSS near enhancers, (F,G) CAGE TSS upstream anti-sense w.r.t. mRNA/lncRNA, and (J,K) active poly(A) sites downstream sense w.r.t. mRNA/lncRNA. “Structured”/“CRS” denote regions that overlap CRSs; “unstructured”/“no CRS” do not. (B,C,F,G) Total RNA-seq in fetal human cerebellum (technical replicate two of experiment ENCSR000AEW; ENCODE Phase 3). (J,K) Poly(A) RNA-seq of human brain (HBM). (B,F,J) Expression levels are in counts per million after cross-experiment relative log expression normalization (CPM/RLE). (C,G,K) GC content and phastCons (from 100-species MULTIZ alignments) of expressed structured (CRS) versus unstructured regions (no CRS). Expressed regions were defined by empirical P-value < 0.01 and CPM/RLE ≥ 1. (D,H) Transcript stability at ENCODE HeLa DHSs, as described in Andersson et al. (2014b), and GC content of structured (CRS) and unstructured regions (no CRS). Odds ratios quantify how strongly stability is associated with CRS overlap.
Figure 6.
Figure 6.
Example CRSs in gene regulatory regions are supported by unannotated transcript boundaries. (A) Two intergenic enhancers in a highly structured region of low SI (CRS M1293227 is only conserved in primates) between two gene 3′ ends. (B) MIR320A is upstream of POLR3D TSS. (C) Anti-sense transcription at the promoter of the lncRNA LINC01132 has enhancer-like chromatin signatures. (D) Intergenic enhancer with unidirectional stable transcription from the minus strand as measured by control and exosome-depleted HeLa cells (Andersson et al. 2014b). Color code in consensus structures is the level of base-pair conservation in the structure-based alignments.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Almada AE, Wu X, Kriz AJ, Burge CB, Sharp PA. 2013. Promoter directionality is controlled by U1 snRNP and polyadenylation signals. Nature 499: 360–363. - PMC - PubMed
    1. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al. 2014a. An atlas of active enhancers across human cell types and tissues. Nature 507: 455–461. - PMC - PubMed
    1. Andersson R, Refsing AP, Valen E, Core LJ, Bornholdt J, Boyd M, Heick JT, Sandelin A. 2014b. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nat Commun 5: 5336. - PubMed
    1. Arner E, Daub CO, Vitting-Seerup K, Andersson R, Lilje B, Drablos F, Lennartsson A, Ronnerblad M, Hrydziuszko O, Vitezic M, et al. 2015. Gene regulation: transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347: 1010–1014. - PMC - PubMed

Publication types