Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 17 (6), 852-64

Structured RNAs in the ENCODE Selected Regions of the Human Genome

Affiliations

Structured RNAs in the ENCODE Selected Regions of the Human Genome

Stefan Washietl et al. Genome Res.

Abstract

Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).

Figures

Figure 1.
Figure 1.
Score distribution of AlifoldZ, RNAz, and EvoFold computed for all input alignments. (A) Minimum free energies of the consensus structures as computed by RNAalifold. Note that more negative scores correspond to more stable/conserved consensus structures. (B) The significance of the consensus MFEs are estimated by AlifoldZ for all consensus structures with MFE < −15 resulting in normalized Z-scores. Also here negative values mean more stable and conserved structures. The two significance cutoffs used throughout this work are indicated. (C) RNAz classifies alignments using a support vector machine. The distribution of SVM decision variables is shown as well as the two significance cutoffs, which are expressed as “classification probabilities,” P. (D) Enlarged tail of C. (E) Raw EvoFold scores on the original input alignments. (F) EvoFold scores after extracting the predicted substructure, filtering weak structures (see Methods), and rescoring. The histogram shows all predictions of which the top-scoring 50% were chosen as the high significance prediction set.
Figure 2.
Figure 2.
Overlap of predictions from different methods (high significance level). The sets are drawn to scale for overlap in terms of nucleotides, and numbers indicate overlapping predictions. In addition, we give the total number of items outside the respective sets. (Left) All predictions; (right) predictions without coding exons and UTRs according to GENCODE annotation.
Figure 3.
Figure 3.
Densities of EvoFold and RNAz predictions (high significance level) as a function of GC content and sequence conservation measured by the phastCons program (Siepel et al. 2005). While most RNAz predictions have elevated GC content and moderate sequence conservation, EvoFold is most sensitive at low GC contents and high sequence conservation.
Figure 4.
Figure 4.
Overlap of predicted structured RNAs (high significance level) with the union of TARs/Transfrags and the “moderate” set of sequence-constrained elements. Hits in coding exons and UTRs are excluded.
Figure 5.
Figure 5.
Genomic location of predicted RNAs (high significance level) relative to the GENCODE protein gene annotation. For comparison, the annotation of the input alignments is shown for both RNAz and EvoFold (they differ slightly because of the different filtering steps used for each program; see Methods). “Distal” and “Proximal” refer to a distance boundary of 5 kb away from the next gene (intergenic fraction) or coding exon (intronic fraction). Some hits fall within more than one annotation category, thus the sums of the fractions are slightly >100%.
Figure 6.
Figure 6.
RT-PCR verification of ncRNA predictions. Positive controls include the known small ncRNAs listed in Table 3 as well as eight randomly chosen mRNAs of GENCODE protein-coding genes. Negative controls are randomly selected intergenic and intronic regions. Sets of RNAz and EvoFold predictions were manually selected both overlapping (T+) and not overlapping (T−) with TARs/transfrags. In addition, we selected a set of overlapping RNAz/EvoFold predictions (see Methods).
Figure 7.
Figure 7.
Selected high scoring examples. (Left) UCSC Genome Browser screenshots featuring conserved RNA predictions and additional ENCODE analysis tracks are shown. The significance levels of RNAz and EvoFold hits are color coded (see legend). (*) Significant AlifoldZ hits; the Z-score is shown. In addition, the results of the RACE/microarray experiments, TARs/Transfrags, constrained elements, phastCons scores, and GENCODE annotations are shown. For details on these tracks, refer to Methods. (Right) Consensus structure models generated by RNAalifold are shown for selected hits (marked by gray, dashed boxes; in example G, the first three hits and the sixth hit are shown). In the consensus structures, variable positions are circled indicating compensatory and consistent mutations supporting the structure. The color indicates the number of different nucleotide combinations forming one base pair. Inconsistent mutations lead to pale colors. Examples AC show predicted structures in intergenic regions. Examples D and E are located in introns of protein-coding regions. Examples F and G show structures associated with alternative spliced transcripts of protein-coding loci detected by the GENCODE project. For further information, refer to the text.

Similar articles

See all similar articles

Cited by 76 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback