Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 16;2(4):lqaa090.
doi: 10.1093/nargab/lqaa090. eCollection 2020 Dec.

Machine learning a model for RNA structure prediction

Affiliations

Machine learning a model for RNA structure prediction

Nicola Calonaci et al. NAR Genom Bioinform. .

Abstract

RNA function crucially depends on its structure. Thermodynamic models currently used for secondary structure prediction rely on computing the partition function of folding ensembles, and can thus estimate minimum free-energy structures and ensemble populations. These models sometimes fail in identifying native structures unless complemented by auxiliary experimental data. Here, we build a set of models that combine thermodynamic parameters, chemical probing data (DMS and SHAPE) and co-evolutionary data (direct coupling analysis) through a network that outputs perturbations to the ensemble free energy. Perturbations are trained to increase the ensemble populations of a representative set of known native RNA structures. In the chemical probing nodes of the network, a convolutional window combines neighboring reactivities, enlightening their structural information content and the contribution of local conformational ensembles. Regularization is used to limit overfitting and improve transferability. The most transferable model is selected through a cross-validation strategy that estimates the performance of models on systems on which they are not trained. With the selected model we obtain increased ensemble populations for native structures and more accurate predictions in an independent validation set. The flexibility of the approach allows the model to be easily retrained and adapted to incorporate arbitrary experimental information.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Graphical scheme of the machine learning procedure. (A) Models that integrate RNAfold, chemical probing experiments and DCA scores into prediction of structure populations are trained. One among all the proposed models is selected based on a transferability criterion and validated against data that is not seen during training. Available reference structures are used as target for training and validation. (B) Sequence, reactivity profile and DCA data are included through additional terms in the RNAfold model free energy. The network is split into two channels: a single-layered channel for reactivity input (left side) and a double-layered channel for DCA couplings (right side). Along the reactivity channel, a convolutional layer operates a linear combination on the sliding window including the reactivity Ri of a nucleotide and the reactivities {Ri + k} of its neighbors, with weights {ak} and bias b. The output consists in a pairing penalty λi for the i-th nucleotide. In the DCA channel, the first layer transforms the input DCA coupling Jij via a non-linear (sigmoid) activation function, with weight A and bias B. The transformed DCA input is then mapped to a pairing penalty λij for the specific ij pair via a second layer, implementing a linear activation function with weight C and bias D. Penalties for both individual nucleotides and for specific pairs are applied as perturbations to the RNAfold free-energy model.
Figure 2.
Figure 2.
Population of native structure as function of hyperparameters. Population is indicated in the color scale. The optimized population of native structures, when averaged on the training set (A), is by construction a monotonically increasing function of the integer p controlling the window size of the convolutional network in the reactivity channel, and a monotonically decreasing function of the regularization coefficients αS and αD. When averaged on the leave-one-out iterations of the cross-validation (CV) procedure (B), the dependency of the optimized population of native structures on these hyperparameters becomes non-trivial, as it results from a combination of model complexity (controlled by p) and regularization (controlled by αS and αD independently). The CV procedure serves as criterion for model selection, resulting in the selection of hyperparameters {p = 0, αS = 0.001, αD = 0.001}.
Figure 3.
Figure 3.
Comparison of results obtained with unmodified RNAfold and with selected models, respectively: populations of native structures with (A) the best performing model; (B) the best performing model with DCA data only; (C) the best performing model with chemical probing data only. (D) Matthews correlation coefficients between predicted MFE structures and reference native structures, as obtained with selected (best, DCA-only, chemical probing-only) models and with unmodified RNAfold. Hyperparameters are noted in the figure. Native structure populations obtained with unmodified RNAfold (black cross), with our trained model (magenta star on the training set, red star on the validation set) and in the leave-one-out procedure (blue circle, for each molecule the model is trained on all the other molecules in the training set) are reported. Populations obtained by mapping SHAPE reactivities into penalties with the method in Ref. (15) are reported for comparison (green plus), only for molecules studied in previous work and in panel (C) where chemical probing data only are used. The populations of native structures that we obtain with the trained model are almost always increased for molecules both in the training (left side of the vertical line) and in the validation set (right side), with overfitting occurring in a few cases, where populations lower than obtained with unmodified RNAfold are yielded.
Figure 4.
Figure 4.
MFE structure predictions. For each system in the validation set, reference native structure is compared with predicted MFEs. For panel description, see main text. Correctly predicted base pairs (true positives) and unpaired nucleotides (true negatives) are reported in dark green and lime green, respectively. Wrongly predicted base pairs (false positives) and unpaired nucleotides (false negatives) are reported in orange and red, respectively. MCC between prediction and reference is reported in parenthesis. All the relevant improvements in the prediction of these structures are reported in detail in ‘Results’ section. All secondary structure diagrams are drawn with forna (48).
Figure 5.
Figure 5.
Properties of the optimized neural network. For the DCA channel, the optimized function mapping DCA couplings Jij into pairing penalties λij, for both (A) the selected model and (B) the best performing model with restriction to only DCA input. When trained on the whole training set (red) the activation function is consistent with the average on the leave-one-out training subsets (orange). Error bars are computed as standard deviations and are significantly lower in the region of DCA couplings around zero, as couplings lying in that region are more frequent. The trained function maps high (respectively, low) DCA coupling values to penalties favoring (respectively, disfavoring) the corresponding pairings, thus affecting the population of the structures including the specific pair. When restricting to (B) models including only DCA input, the threshold value of the coupling Jthreshold between disfavored and favored pairing corresponds to the zero of the activation function, as indicated by the dashed line. For the chemical mapping channel, (C) optimal values of model parameters are shown for the selected model (black) with hyperparameters {αS = 0.001, αD = 0.001, p = 0}, and for the sub-optimal models with p > 0. All the training results (cross) lie within the leave-one-out errors (dots with error bars), indicating robustness of the minimization procedure against cross-validation. Coefficients {ak, …, a+k},  k > 0 weighting reactivities up to the k-th nearest-neighbors of a nucleotide, report the minor contributions of the local reactivity pattern in addition to the nucleotide’s own reactivity.

Similar articles

Cited by

References

    1. Cech T.R. The ribosome is a ribozyme. Science. 2000; 289:878–879. - PubMed
    1. Doudna J., Cech T. The chemical repertoire of natural ribozymes. Nature. 2002; 418:222–228. - PubMed
    1. Morris K.V., Mattick J.S. The rise of regulatory RNA. Nat. Rev. Genet. 2014; 15:423–437. - PMC - PubMed
    1. Wan Y., Kertesz M., Spitale R.C., Segal E., Chang H.Y. Understanding the transcriptome through RNA structure. Nat. Rev. Genet. 2011; 12:641–655. - PMC - PubMed
    1. Cooper T.A., Wan L., Dreyfuss G. RNA and disease. Cell. 2009; 136:777–793. - PMC - PubMed