Avoidance of Stochastic RNA Interactions Can Be Harnessed to Control Protein Expression Levels in Bacteria and Archaea
Free PMC article
Item in Clipboard
Avoidance of Stochastic RNA Interactions Can Be Harnessed to Control Protein Expression Levels in Bacteria and Archaea
Free PMC article
A critical assumption of gene expression analysis is that mRNA abundances broadly correlate with protein abundance, but these two are often imperfectly correlated. Some of the discrepancy can be accounted for by two important mRNA features: codon usage and mRNA secondary structure. We present a new global factor, called mRNA:ncRNA avoidance, and provide evidence that avoidance increases translational efficiency. We also demonstrate a strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes, and that these have a greater impact on protein abundance than mRNA structure or codon usage. By generating synonymously variant green fluorescent protein (GFP) mRNAs with different potential for mRNA:ncRNA interactions, we demonstrate that GFP levels correlate well with interaction avoidance. Therefore, taking stochastic mRNA:ncRNA interactions into account enables precise modulation of protein abundance.
Archaea; E. coli; bacteria; bioinformatics; computational biology; evolutionary biology; gene expression; genomics; ncRNA; systems biology.
Conflict of interest statement
The authors declare that no competing interests exist.
Figure 1.. mRNA:ncRNA avoidance is a conserved feature of bacteria and archaea.
A) Native core mRNA:ncRNA binding energies (green line; mean = −3.21 kcal/mol) are significantly higher than all mRNA negative control binding energies (dashed lines; mean binding energies are -3.62, -5.21, -4.13, -3.86 & -3.92 kcal/mol respectively) in pairwise comparisons (p<2.2 × 10 −16 for all pairs, one-tailed Mann-Whitney U test) for Streptococcus suis RNAs. ( B) The difference between the density distributions of native mRNA:ncRNA binding energies and dinucleotide preserved shuffled mRNA:ncRNA controls as a function of binding energy for different taxonomic phyla. Each coloured curve illustrates the degree of extrinsic avoidance for different bacterial phyla or the archaea. Positive differences indicate an excess in native binding for that energy value, negative differences indicate an excess of interactions in the shuffled controls. The dashed black line shows the expected result if no difference exists between these distributions and the dashed grey lines show empirical differences for shuffled vs shuffled densities from 100 randomly selected bacterial strains. ( C) This box and whisker plot shows −log 10( P) distributions for each phylum and the archaea, the p-values are derived from a one-tailed Mann-Whitney U test for each genome of native mRNA:ncRNA versus shuffled mRNA:ncRNA binding energies. The black dashed line indicates the significance threshold (p<0.05). ( D) A high intrinsic avoidance strain ( Thermodesulfobacterium sp. OPB45) shows a clear separation between the G+C distribution of mRNAs and ncRNAs (p=9.2 × 10 −25, two-tailed Mann-Whitney U test), and a low intrinsic avoidance strain ( Mycobacterium sp. JDM601) has no G+C difference between mRNAs and ncRNAs (p=0.54, two-tailed Mann-Whitney U test). ( E) The x-axis shows −log 10( P) for our test of extrinsic avoidance using binding energy estimates for both native and shuffled controls, while the y-axis shows −log 10(P) for our intrinsic test of avoidance based upon the difference in G+C contents of ncRNAs and mRNAs. Two perpendicular dashed black lines show the threshold of significance for both avoidance metrics. 97% of bacteria and archaea are significant for at least one of these tests of avoidance. DOI:
Figure 1—figure supplement 1.. Applying different energy models of intramolecular and intermolecular interactions for native sequences and various negative controls.
A) The distributions of internal secondary structure (intramolecular) minimum free energies (MFEs) for 5 ends of mRNA sequences, estimated using RNAfold from the Vienna package (Lorenz et al., 2011). ( B) The distributions of hybridization MFEs between core mRNAs and ncRNAs, estimated using the RNAduplex algorithm from the Vienna package (Lorenz et al., 2011). ( C) The distributions of binding MFEs between core mRNAs and ncRNAs, estimated using the RNAup algorithm (Lorenz et al., 2011). The RNAup algorithm minimizes the sum of energies necessary to open binding sites on two RNA molecules and the hybridization energy (Lorenz et al., 2011). This method has been shown to be the most accurate general approach for sequence-based RNA interaction prediction (Pain et al., 2015). DOI:
Figure 1—figure supplement 2.. The top and the bottom panels show bacterial phyla and archaeal phyla respectively.
Numbers in brackets show the total members and the x-axis displays the percentage of extrinsic avoidance conservation in associated phylum. The archaeal and bacterial phyla with fewer than 20 publicly available sequenced genomes were excluded from further analysis due to concerns about sample size sufficiency.
Figure 2.. mRNA attributes have different impacts on protein abundance.
A) This heatmap summarizes the effect sizes of four mRNA attributes (avoidance of mRNA:ncRNA interaction, 5´ end secondary structure, codon bias and mRNA abundance) on protein expression as Spearman’s correlation coefficients, which are represented in gradient colors, while a starred block shows if the associated correlation is significant (p<0.05). ( B) GFP expression correlates with optimized codon selection, measured by CAI (R s = 0.29, p=0.016). ( C) GFP expression correlates with 5 end secondary structure of mRNAs, measured by 5’ end intramolecular folding energy (R ′ s = 0.34, p=0.006). ( D) GFP expression correlates with avoidance, measured by mRNA:ncRNA binding energy (R s = 0.56, p=6.9 × 10 −6). ( E) Each cartoon illustrates the corresponding hypothesis; (1) optimal codon distribution (corresponding tRNAs are available for translation), (2) low 5´ end RNA structure (high folding energy of 5´ end) and (3) avoidance (fewer crosstalk interactions) lead to faster translation. DOI:
Figure 2—figure supplement 1.. GFP mRNA constructs have an unbiased design that produces different protein expressions.
An unrooted maximum likelihood tree of the extreme GFP mRNAs on the left panel illustrates the low similarity between our GFP mRNA constructs. The distances were calculated using HKY85 nucleotide substitution model. On the right panel, the y-axis shows relative fluorescence units (RFU) of GFP expression from synonymously sampled mRNAs with different characteristics, these are labelled on the figure legend. Optimal and high avoidance GFP mRNAs produce the highest expression while low avoidance GFP mRNAs have the lowest expression (p=1.35 × 10
−5, Kruskal-Wallis test). DOI:
Figure 2—figure supplement 2.. The scatter-plots of protein abundances (as log-fluorescences) summarize the effect of general factors for extreme GFP and previously published GFP datasets.
A–C) Each GFP mRNA was sampled from the extremes of one of three metrics presumed to impact expression mRNA:ncRNA binding, 5´ end secondary structure or codon usage. Slightly darker or lighter colors display the type of extremes. Avoidance correlates with GFP expression (R s = 0.56, p=6.9 × 10 −6) more than CAI (R s = 0.29, p=0.01) and 5´ end folding energy (R s = 0.34, p=0.006). ( D–F) Using a previously published GFP dataset (Kudla et al., 2009) the CAI does not correlate with protein abundance (R s = 0.02, p=0.4), while 5´ end folding energy (R s = 0.61, p=5.7 × 10 −18) and avoidance (R s = 0.65, p=1.6 × 10 −20) influence GFP expression. DOI:
Figure 2—figure supplement 3.. In the lower four panels we show the R
2 values for linear regression models between measures of each of avoidance, internal secondary structure, codon usage and mRNA levels for each of seven independent protein and mRNA expression datasets Supplementary file 5).
We have also computed R
2 values for multiple linear regression models of the sum of the four measures (right) and the sum less the avoidance measure (right). DOI:
Figure 2—figure supplement 4.. An outlier analysis of E. coli protein-per-mRNA ratios and avoidance, codon usage and internal mRNA secondary structure statistics.
A) In this plot a distribution of protein-per-mRNA ratio of native E. coli genes (n = 389) (Laurent et al., 2010) is seen. We selected the top ten most and least productive genes which lie on the extreme ends of the plot (purple and green bars) ( B) The y-axis shows the z-transformed scores of native mRNAs: CAIs, folding energies and binding energies. The expected background distribution (the white null bar in the middle) has a mean of 0 and standard deviation of 1, while a starred block shows whether the associated z-scores are significantly higher (or lower) than this background (p<0.05). This demonstrates RNA avoidance is the only factor that explains protein-per-mRNA ratio difference of the most and the least efficient native E. coli mRNAs. DOI:
Figure 2—figure supplement 5.. Overview of mRNA:ncRNA avoidance analysis and results.
Our tests for avoidance can be divided into three main parts; (1) evolutionary conservation analyses to detect energy shifts in bacterial and archaeal genomes relative to dinucleotide shuffled negative controls, (2) analyses of proteomics, transcriptomics and GFP transformation data to predict the effect size of avoidance on protein expression and lastly (3) the application of avoidance hypothesis to design synonymous mRNAs that either produce high or low levels of corresponding protein.
Figure 3.. The most under-represented mRNA:rRNA interactions correspond to exterior regions of the ribosome.
A) In the upper bar, the regions of the T. thermophilus SSU rRNA that are under-represented in stable interactions with mRNAs (p<0.05) are highlighted in red. In the lower bar, the inaccessible residues (<3.4 Angstroms from other nucleotides or amino acids in the PDB structure 4WZO). ( B) The 3 dimensional structure of the T. thermophilus ribosome includes 5S, SSU and LSU rRNA, 48 ribosomal proteins, 4 tRNA and a bound mRNA (PDB ID: 4WZO) (Rozov et al., 2015). We have highlighted the most avoided regions of the SSU rRNA in red (based upon the fewest stable interactions with T. thermophilus mRNAs (p<0.05). Two different orientations are shown on the left and right, the upper structure shows just the SSU rRNA and mRNA structures, the lower includes the ribosomal proteins (coloured blue). Bottom right, a view of the ribosome that also includes the LSU rRNA (green) is also shown. There is a significant correspondence between the accessibility of a region of SSU rRNA and the degree to which it is avoided (p=2.5 × 10 −17, Fisher’s exact test). DOI:
Figure 3—figure supplement 1.. Avoidance pattern and its correlation with protein expression vary on mRNAs.
A) A sliding window (length 21, step size 1) analysis based on previously published GFP expression dataset (Kudla et al., 2009) shows the significance of the correlation between avoidance and their corresponding fluorescence values for each position along with the coding region. Darker red regions show more significant positions (with higher −log 10(P) values). ( B) This analysis proves that the binding energy of first 21 nt region influences protein expression more than any other downstream region and corresponding Spearman’s correlation coefficients for selected sliding window start positions are seen at bottomright. It also justifies our selection of 5´ end coding region for avoidance. DOI:
Figure 3—figure supplement 2.. Comparison of different regions for evolutionary conservation analyses.
A) This box and whisker plot (similar to Figure 1C except archaea) shows −log 10(P) distributions for each bacterial phylum. The black dashed line indicates the significance threshold (p<0.05). We used 5´ end CDS regions as designated interaction location. ( B) In this plot, 5´ end UTR regions (90 nucleotides upstream to 21 nucleotides downstream) are used as designated interaction regions. It seems both regions have similar avoidance conservation, which proves avoidance is not limited to 5´ ends of the coding region. DOI:
Figure 3—figure supplement 3.. The most avoided regions of selected
T. thermophilus non-coding RNAs.
A) A graphical view for an alignment of the T. thermophilus tRNAs (n = 46). Regions that have significantly (p<0.001, Mann-Whitney U test) fewer than expected interactions with T. thermophilus mRNAs are highlighted in red. These regions are therefore the most avoided regions by the host’s mRNAs. The grey blocks show gaps in the alignment. ( B–D) A graphical view of the most avoided regions is illustrated for tmRNA, RNase P and SRP RNA respectively. DOI:
Figure 4.. The median expression of core ncRNA genes (n = 325 data points) in prokaryotic genomes is nearly two orders of magnitude greater than core mRNAs (n = 8086 data points) which proves that ncRNAs constitute most of the cellular RNAs.
To create this plot, we used mean mapped reads per gene length (i.e. mean read depth per position) of each core gene. The expression data are compiled from 5 archaeal and 37 bacterial strains from a previous study (Lindgreen et al., 2014).
All figures (14)
Selecting against accidental RNA interactions.
Elife. 2016 Sep 20;5:e20686. doi: 10.7554/eLife.20686.
27642846 Free PMC article.
Translational control and target recognition by Escherichia coli small RNAs in vivo.
Nucleic Acids Res. 2007;35(3):1018-37. doi: 10.1093/nar/gkl1040. Epub 2007 Jan 30.
Nucleic Acids Res. 2007.
17264113 Free PMC article.
Slicing tRNAs to boost functional ncRNA diversity.
RNA Biol. 2013 Dec;10(12):1798-806. doi: 10.4161/rna.27177. Epub 2013 Nov 21.
RNA Biol. 2013.
24351723 Free PMC article.
Translation initiation in Archaea: conserved and domain-specific features.
Biochem Soc Trans. 2011 Jan;39(1):89-93. doi: 10.1042/BST0390089.
Biochem Soc Trans. 2011.
RNA⁻Protein Interactions Prevent Long RNA Duplex Formation: Implications for the Design of RNA-Based Therapeutics.
Molecules. 2018 Dec 15;23(12):3329. doi: 10.3390/molecules23123329.
30558267 Free PMC article.
Manifold Routes to a Nucleus.
Front Microbiol. 2018 Oct 26;9:2604. doi: 10.3389/fmicb.2018.02604. eCollection 2018.
Front Microbiol. 2018.
30416499 Free PMC article.
Within-Gene Shine-Dalgarno Sequences Are Not Selected for Function.
Mol Biol Evol. 2018 Oct 1;35(10):2487-2498. doi: 10.1093/molbev/msy150.
Mol Biol Evol. 2018.
30085185 Free PMC article.
In vivo selection of sfGFP variants with improved and reliable functionality in industrially important thermophilic bacteria.
Biotechnol Biofuels. 2018 Jan 17;11:8. doi: 10.1186/s13068-017-1008-5. eCollection 2018.
Biotechnol Biofuels. 2018.
29371884 Free PMC article.
Local genic base composition impacts protein production and cellular fitness.
PeerJ. 2018 Jan 16;6:e4286. doi: 10.7717/peerj.4286. eCollection 2018.
29362699 Free PMC article.
Andersson SG, Kurland CG. Codon preferences in free-living microorganisms. Microbiological Reviews. 1990;54:198–210.
Bartel DP, Chen CZ. Micromanagers of gene expression: the potentially widespread influence of metazoan microRNAs. Nature Reviews Genetics. 2004;5:396–400. doi: 10.1038/nrg1328.
Bhaya D, Davison M, Barrangou R. CRISPR-Cas systems in bacteria and archaea: versatile small RNAs for adaptive defense and regulation. Annual Review of Genetics. 2011;45:273–297. doi: 10.1146/annurev-genet-110410-132430.
Borg A, Ehrenberg M. Determinants of the rate of mRNA translocation in bacterial protein synthesis. Journal of Molecular Biology. 2015;427:1835–1847. doi: 10.1016/j.jmb.2014.10.027.
Boël G, Letso R, Neely H, Price WN, Wong K-H, Su M, Luff JD, Valecha M, Everett JK, Acton TB, Xiao R, Montelione GT, Aalberts DP, Hunt JF. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature. 2016-21;529:358–363. doi: 10.1038/nature16509.
Research Support, Non-U.S. Gov't
Green Fluorescent Proteins / analysis
Green Fluorescent Proteins / genetics
RNA, Messenger / metabolism*
RNA, Untranslated / metabolism*
Green Fluorescent Proteins
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.