Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 6;13(11):e1005827.
doi: 10.1371/journal.pcbi.1005827. eCollection 2017 Nov.

Base Pair Probability Estimates Improve the Prediction Accuracy of RNA Non-Canonical Base Pairs

Affiliations
Free PMC article

Base Pair Probability Estimates Improve the Prediction Accuracy of RNA Non-Canonical Base Pairs

Michael F Sloma et al. PLoS Comput Biol. .
Free PMC article

Abstract

Prediction of RNA tertiary structure from sequence is an important problem, but generating accurate structure models for even short sequences remains difficult. Predictions of RNA tertiary structure tend to be least accurate in loop regions, where non-canonical pairs are important for determining the details of structure. Non-canonical pairs can be predicted using a knowledge-based model of structure that scores nucleotide cyclic motifs, or NCMs. In this work, a partition function algorithm is introduced that allows the estimation of base pairing probabilities for both canonical and non-canonical interactions. Pairs that are predicted to be probable are more likely to be found in the true structure than pairs of lower probability. Pair probability estimates can be further improved by predicting the structure conserved across multiple homologous sequences using the TurboFold algorithm. These pairing probabilities, used in concert with prior knowledge of the canonical secondary structure, allow accurate inference of non-canonical pairs, an important step towards accurate prediction of the full tertiary structure. Software to predict non-canonical base pairs and pairing probabilities is now provided as part of the RNAstructure software package.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Prediction of extended secondary structure with CycleFold.
(A) A predicted structure for a Sarcin-Ricin loop sequence form Rattus norvegicus [50] using CycleFold with the MFE algorithm. Correctly predicted canonical pairs are drawn with heavy black lines, correctly predicted non-canonical pairs are light black lines, and the incorrectly predicted non-canonical pair is shown with a gray dashed line. The G-A pair at the base of the tetraloop is not present in the reference structure because the 3’ A is not stacked on the subsequent G, but is instead in contact with a protein, Restrictocin. (B) The probability dot plot calculated using Cyclefold with the partition function algorithm. The upper right triangle shows pairs with estimated probabilities > 0.01, color-coded by pairing probability. The lower left triangle shows the pairs that are present in the reference structure. Each dot represents a single base pair, and nucleotide index (starting with 1 at the 5’ end) is shown along the x and y axes.
Fig 2
Fig 2. Benchmark of single-sequence prediction of canonical base pairs.
(A) Prediction accuracy of the lowest free energy structure, evaluated on canonical pairs. (B) Prediction with CycleFold, using structures composed of highly probable canonical pairs. Sensitivity and PPV are reported for structures with probability greater than a threshold labeled on the plot). This demonstrates that the threshold stringency provides a tradeoff in terms of sensitivity and PPV.
Fig 3
Fig 3. Benchmark of single-sequence prediction of non-canonical base pairs.
(A) Prediction accuracy of the lowest free energy structure, evaluated on non-canonical pairs. This includes a calculation where CycleFold is constrained to include the known canonical base pairs to illustrate the performance of the NCM approach when canonical base pairs are known. (B) Prediction with CycleFold, using structures composed of highly probable non-canonical pairs. Sensitivity and PPV are reported for structures with probability greater than a specified threshold (labeled on the plot). This demonstrates that the threshold stringency provides a tradeoff in terms of sensitivity and PPV.
Fig 4
Fig 4
Prediction of canonical base pairs by predicting a conserved structure using multiple homologous sequences for (A) an MVE virus nuclease resistant RNA [53], (B) a D. radiodurans SRP hairpin domain [54], and (C) a O. sativa Twister ribozyme [55]. Prediction accuracy is shown for structures composed of highly probable pairs using information from a single sequence (blue) or a TurboFold calculation with 10 sequences (red). Also shown is prediction accuracy using evolutionary couplings from the plmc program [16] (green).
Fig 5
Fig 5
Prediction of non-canonical base pairs by predicting the conserved structure with multiple homologous sequences for (A) an MVE virus nuclease resistant RNA, (B) a D. radiodurans SRP hairpin domain, and (C) a O. sativa Twister ribozyme. Prediction accuracy is shown for structures composed of highly probable pairs using information from a single sequence (blue) or a TurboFold calculation on 10 sequences (red). In panels A and B, no blue line is present because the single sequence prediction did not correctly predict any pairs. Also shown is prediction using evolutionary couplings from the plmc program [16] (green).
Fig 6
Fig 6. An example of a pseudo-energy calculation using the NCM model.
ΔGjunction is evaluated for each pair of NCMs that share an edge, i.e. the ones that have an overlapping base pair. This term depends on the identities of the two NCMs (that is, the length of the 5’ and 3’ cycles for a double-stranded NCM, or the total length for a single-stranded NCM), and the nucleotides in the common base pair. This term is evaluated for the junction of NCM a with NCM b, NCM b with NCM c, and NCM c with NCM d.
Fig 7
Fig 7. A recursion diagram [41] illustrating the NCM partition function algorithm.
Filled regions indicate terms that are being added to the partition function, and empty regions indicate results that were previously calculated. Solid lines indicate nucleotides that must be paired, while dotted lines indicate nucleotides that may or may not be paired.

Similar articles

See all similar articles

Cited by 5 articles

References

    1. Cech TR, Steitz JA (2014) The noncoding RNA revolution-trashing old rules to forge new ones. Cell 157: 77–94. doi: 10.1016/j.cell.2014.03.008 - DOI - PubMed
    1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242. - PMC - PubMed
    1. Leontis N, Zirbel CL (2012) Nonredundant 3D Structure Datasets for RNA Knowledge Extraction and Benchmarking In: Leontis N, Westhof E, editors. RNA 3D Structure Analysis and Prediction: Springer Berlin Heidelberg.
    1. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22: 1775–1789. doi: 10.1101/gr.132159.111 - DOI - PMC - PubMed
    1. Dill KA, MacCallum JL (2012) The protein-folding problem, 50 years on. Science 338: 1042–1046. doi: 10.1126/science.1219021 - DOI - PubMed

LinkOut - more resources

Feedback