Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 24;369(6502):eaav3751.
doi: 10.1126/science.aav3751.

Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation

Affiliations

Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation

Derek N Macklin et al. Science. .

Abstract

The extensive heterogeneity of biological data poses challenges to analysis and interpretation. Construction of a large-scale mechanistic model of Escherichia coli enabled us to integrate and cross-evaluate a massive, heterogeneous dataset based on measurements reported by various groups over decades. We identified inconsistencies with functional consequences across the data, including that the total output of the ribosomes and RNA polymerases described by data are not sufficient for a cell to reproduce measured doubling times, that measured metabolic parameters are neither fully compatible with each other nor with overall growth, and that essential proteins are absent during the cell cycle-and the cell is robust to this absence. Finally, considering these data as a whole leads to successful predictions of new experimental outcomes, in this case protein half-lives.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. A large-scale, integrated modeling approach to simultaneously cross-evaluate millions of heterogeneous data.
The data were collected from the primary literature and key databases, and in some cases were also generated as part of this study. Subsequent data curation and analysis led to the determination of 19,119 parameter values. We then incorporated these data into a large-scale computational model of E. coli gene expression, metabolism and growth, based on a foundation of > 10,000 interdependent mathematical equations that are then transformed into appropriate computational representations of biological processes. Color coding is used to connect terms in these equations to the data that produced their parameter values. This unified model was then used to produce fully integrated simulations, with output as shown at bottom. See Fig. S1, Movies S1 and S2, and the Supplement for more detail. Full details of the analysis required to generate this figure, as well as a pointer to the generating code, can be found in the Supplement, Section 1.2.
Fig. 2.
Fig. 2.. Ribosomal and RNA polymerase output must be increased to support measured doubling times.
(A) Histogram comparing simulated doubling times (blue) to the experimentally determined doubling time for aerobic growth on glucose minimal media (orange line) with the model’s original parameter values taken directly from the literature. Median simulated doubling time is 125 minutes (dashed black line). (B) Sensitivity analysis outcome reported as the z-score (log-scale) of the difference in growth rate for all simulations where a given parameter was adjusted higher and all simulations where a given parameter was adjusted lower. Horizontal dashed lines represent a z-score cutoff for a p-value below 0.05 that has been adjusted for multiple hypothesis testing of each of the parameters that were adjusted (93% of the total parameters, see supplement for more details). Parameters are ordered by their impact on the simulated cells’ growth rate along the x-axis with those having a significant z-score highlighted in orange and shown in more detail above and below the plot of all parameters. Parameters with the largest positive correlation with model growth are listed across the top, and parameters with the largest negative correlation are listed across the bottom. Parameter abbreviations are: translational efficiency (TE), RNA synthesis probability (SP) and protein degradation rate (PD). (C and D) Histograms comparing simulated doubling times (blue) to the experimentally determined doubling time for aerobic growth on glucose minimal media (orange line), with RNA polymerase, ribosome, and RNAse expression calculated from the known doubling time as independent experiments (C), and with both RNA polymerase and ribosome expression calculated from the known doubling time (D). Median simulated doubling times are shown as dashed black lines. (E) RNA polymerase and ribosome abundances per cell as generated by the model in this study using the original (Fig. 2A) and new transcript synthesis probabilities (Fig. 2D), as compared to experimental data that was withheld from the model’s original parameterization from (20). (F) Comparison of mRNA expression as measured by RNA-sequencing in this study (TPM, transcripts per million) and from a previous microarray study (51). (G) Violin plots showing distributions of RNA polymerase and ribosome cellular abundances from the simulations shown in Fig. 2D, compared with expected values determined experimentally (orange lines) (20). (H) Cellular properties calculated from the simulations for three different environmental conditions compared with their counterpart measurements reported in the literature (21). Error bars report standard deviations of each property calculated over the 1,024 cells that were simulated for each medium. Full details of the analysis required to generate this figure, as well as a pointer to the generating code, can be found in the Supplement, Section 1.2.
Fig. 3.
Fig. 3.. Evaluating metabolic parameter values against each other and in the context of cellular growth.
(A) Violin plot of concentrations at each simulation time point for downstream metabolites of the reaction catalyzed by CdsA – phosphatidylethanolamine (PE) and phosphatidylglycerol (PG) – when the concentration of CdsA is low (orange – original, short protein half-life) or high (blue – new, longer protein half-life, see main text). (B) Violin plot for glucose yield at each simulation time point for simulations with succinate dehydrogenase and fumarate reductase kinetics constraints disabled or enabled. Experimental value is 0.46 g cell / g glucose at μ=0.900 hr−1 (52). (C) Comparison of the average fluxes from simulations with succinate dehydrogenase and fumarate reductase constraints disabled for a set of reactions in central carbon metabolism with experimental measurements (34). Orange points indicate outlier fluxes, which are discussed in more detail in the text. Correlation is shown for all data points (blue and orange) and when excluding outliers (blue). (D) Impact of individually disabling each kinetic reaction constraint on the succinate dehydrogenase flux in simulations, shown as a z-score representing the average change in flux for removing one constraint compared to the distribution of the average change in flux for removing each constraint. Constraints that have a z-score of <−0.1 are highlighted in orange and shown in more detail. Highlighted reaction constraints are part of the reactions that are further explored in E (abbreviations are listed below in F). (E) Comparison of average metrics for simulations from a two-level full factorial design to test the effects of removing up to eight kinetic constraints of interest. Inset shows the target region where the simulated glucose uptake rate is close to the expected glucose uptake rate and simulation succinate dehydrogenase flux is within a factor of 2 of the experimental flux (green region). Disabled constraint combinations are enumerated for each point in the target region. Orange points indicate simulations run with combinations of disabled constraints that included G, I, N and S; blue points indicate simulations run with at least one of these constraints enabled. (F) Distributions of predicted kcat value at each simulation time step (blue) and curated kinetic parameters (orange) for each reaction identified – citrate synthase (Ci), cytosine deaminase (Cy), phosphoserine aminotransaminase (P), glyoxylate reductase (Gx), isocitrate dehydrogenase (Ic), fumarate reductase (F), glutathione reductase (G), inorganic pyrophosphatase (I), NADH dehydrogenase (N),and succinate dehydrogenase (S). Original is from simulations without constraints for S and F; final is from simulations without constraints for Gx, Ic, G, I, N, and S. The black arrow for N indicates a newly curated kcat parameter that was not used in the model. (G and H) Similar to (B and C), but based on data from simulations with the new set of disabled constraints. (I) Representative output from simulations with the new set of disabled constraints, showing the increase in mass (normalized to initial mass and over a single life cycle) of six key cellular mass fractions. (J) Comparison between the metabolic fluxes calculated directly from the kinetic parameters (target) and the fluxes computed by simulations with the new set of disabled constraints, as summarized by the R2 value. Gray points correspond to reactions with no simulated flux despite having a target flux. Correlations are shown for all data points (blue and gray) and with gray points excluded (blue only). Full details of the analysis required to generate this figure, as well as a pointer to the generating code, can be found in the Supplement, Section 1.2.
Fig. 4.
Fig. 4.. A large fraction of E. coli genes are transcribed less than once per cell cycle.
(A) A comparison of simulation and experimental results (32) with regard to the number of proteins expressed per cell for each gene. The proteins are grouped as being highly abundant if the measured count per cell is greater than or equal to 30, and otherwise low-abundant. The R-squared statistic is computed separately for each group on the log-transformed data. (B) Simulations of mRNA and protein expression over multiple generations for genes that are expressed at high (left, in red) and low levels (right, in blue; note that colors are conserved to preserve meaning throughout the figure) of transcriptional frequencies. Counts are shown for a representative six-generation long window, with an arbitrarily chosen zeroth starting generation. (C) Frequency of observing at least one gene transcript per generation over a 32-generation simulation. Histograms show that 1,547 genes are transcribed at least once per cell cycle (red), 203 genes are essentially never expressed in this environment (yellow), and the remaining 2,603 genes are transcribed with a frequency between zero and one (blue). (D) Expression frequency analysis of known essential genes. (E) Division of the sub-generationally transcribed genes into those for which at least one protein is present at all times during the simulations, and those for which the protein is absent for at least one time step (gray bars). Protein products of essential genes are indicated by the blue bars. Distinct protein units represent sub-generationally expressed monomers and protein complexes composed of sub-generationally expressed monomers. (F) Transcription, translation, complexation and metabolic activity of the PabAB heterodimer, which catalyzes a reaction responsible for producing folates. Each new generation is indicated with a tick mark along the x-axis; the gray area highlights a period of time in which the heterodimer is not present in the cell. All y-axes are linearly scaled except the [10-3, 0.44] region of the reaction flux plot which is log-scaled for better readability. Full details of the analysis required to generate this figure, as well as a pointer to the generating code, can be found in the Supplement, Section 1.2.
Fig. 5.
Fig. 5.. Integrated model-data comparison leads to improved prediction of protein half-lives.
(A) A comparison of calculated protein production rates against protein loss rates for each gene. Bold lines indicate areas where the production rate and loss rate differs by more than one order of magnitude. (B) Comparison of the N-end rule to new measurements of protein half-lives for the genes highlighted in (A). The three points highlighted in red were predicted to be outliers in the steady-state analysis because their corresponding protein half-lives were much shorter than the N-end rule’s prediction of 10 hours. Similarly, the proteins highlighted in blue were were predicted to have much longer half-lives than the N-end rule’s prediction of 2 minutes. Solid bars indicate half-lives that were determined by intensities on a western blot and the striped bar indicates an estimate (assumed from higher N-end rule value) from intensity measurements using immunofluorescence. In all seven cases, these predictions were correct. The results of control experiments (testing our protein half-life measurements against previous reports) can be found in Fig. S5. (C) Images of E. coli MG1655 cells with either a His-tagged RpoH or CdsA plasmid that were induced for 1 hour using IPTG followed by the addition of tetracycline to inhibit translation. At the indicated timepoints, aliquots of the culture were harvested, and immunofluorescence was carried out using an anti-His antibody. His-RpoH protein signal decreased within minutes, while His-CdsA protein signal was maintained or increased over the timecourse. All images shown are scaled between 50–1000 AU. Scale bar (yellow) = 10 μm. A detailed look at the localization of RpoH and CdsA is shown in Fig. S5B. Replicates are shown in Figs. S5C and D. Full details of the analysis required to generate this figure, as well as a pointer to the generating code, can be found in the Supplement, Section 1.2.

Comment in

  • Modeling the E. coli cell.
    Tang L. Tang L. Nat Methods. 2020 Oct;17(10):958. doi: 10.1038/s41592-020-00974-8. Nat Methods. 2020. PMID: 32994565 No abstract available.

Similar articles

Cited by

References

    1. Stephens ZD, et al., PLoS biology 13, e1002195 (2015). - PMC - PubMed
    1. Dolinski K, Troyanskaya OG, Molecular biology of the cell 26, 2575 (2015). - PMC - PubMed
    1. O. S. Collaboration, et al., Science 349, aac4716 (2015). - PubMed
    1. Begley CG, Ellis LM, Nature 483, 531 (2012). - PubMed
    1. Domach M, Leung S, Cahn R, Cocks G, Shuler M, Biotechnology and bioengineering 26, 1140 (1984). - PubMed

Publication types

Substances