Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 30;11(1):2113.
doi: 10.1038/s41467-020-15977-4.

Model-driven generation of artificial yeast promoters

Affiliations

Model-driven generation of artificial yeast promoters

Benjamin J Kotopka et al. Nat Commun. .

Abstract

Promoters play a central role in controlling gene regulation; however, a small set of promoters is used for most genetic construct design in the yeast Saccharomyces cerevisiae. Generating and utilizing models that accurately predict protein expression from promoter sequences would enable rapid generation of useful promoters and facilitate synthetic biology efforts in this model organism. We measure the gene expression activity of over 675,000 sequences in a constitutive promoter library and over 327,000 sequences in an inducible promoter library. Training an ensemble of convolutional neural networks jointly on the two data sets enables very high (R2 > 0.79) predictive accuracies on multiple sequence-activity prediction tasks. We describe model-guided design strategies that yield large, sequence-diverse sets of promoters exhibiting activities higher than those represented in training data and similar to current best-in-class sequences. Our results show the value of model-guided design as an approach for generating useful DNA parts.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. FACS-seq experimental strategy and data set overview.
a Schematic of tested libraries (above), indicating regions held constant in promoter design (gray boxes); schematic of two-color reporter device used to characterize promoter activity (below). RAP1, GCR1, ZEV transcription factor binding sites, TATA TATA box motif, TSS transcription start site motif. b Schematic of FACS-seq approach for high-throughput promoter activity characterization, in which next-generation sequencing (NGS)-derived histograms of sequence counts in FACS bins generated by sorting a library on promoter activity are used to derive promoter activity (log10 ratio of GFP to mCherry intensity, in arbitrary units) for each sequence in a library. Solid line: point estimate of promoter activity for an example sequence (blue points and histogram bins). Color gradient qualitatively indicates GFP:mCherry ratio for each cell or bin. c Histogram of promoter activities (log10 ratio of GFP to mCherry intensity, in arbitrary units) in the final PGPD library. Only sequences for which at least ten NextSeq reads were counted in each replicate were used in this analysis. Color gradient qualitatively indicates GFP:mCherry ratio for each sequence. d Density scatter plot of induced and uninduced promoter activities measured in the final PZEV library. Only sequences for which at least 20 NextSeq reads were counted in each replicate were used in this analysis. Density: density of plotted points (arbitrary units).
Fig. 2
Fig. 2. A neural network ensemble trained on PGPD and PZEV data accurately predicts promoter activity.
Only sequences for which at least ten NextSeq reads were counted in each replicate were used in analyses of PGPD data; only sequences for which at least 20 NextSeq reads were counted in each replicate were used in analyses of PZEV data. Density: density of plotted points (arbitrary units). a Predicted promoter activities versus FACS-seq measurements for PGPD sequences in the held-out test data. b Predicted promoter activities in the uninduced condition versus FACS-seq measurements for PZEV sequences in the held-out test data. c Predicted promoter activities in the induced condition versus FACS-seq measurements for PZEV sequences in the held-out test data. d Predicted activation ratios (ratio of predicted induced and uninduced promoter activities) versus FACS-seq-derived activation ratios for PZEV sequences in the held-out test data.
Fig. 3
Fig. 3. Performance of designed promoter sets in validation FACS-seq experiment.
In panels (ac), boxes represent interquartile ranges; the bar within each box indicates the median. Whiskers extend to the furthest observation within 1.5 interquartile ranges of the nearest box edge. Numbers over boxplots indicate the number of sequences measured in FACS-seq in each promoter set. Promoter activities are shown here on a linear scale, and were transformed to a scale comeasureable with the results of individual promoter testing using a linear model fit to promoter activities measured by FACS-seq and by individual testing for a set of promoters spanning a range of expression activities. a FACS-seq measurements of promoter activities for PGPD promoter sets (or corresponding training data sequences). Training data: selected highly active sequences from the initial PGPD FACS-seq; Screening: PGPD promoter set generated using the screening approach; Evolution: PGPD promoter set generated using the evolution approach; Evolution-GC: PGPD promoter set generated using the evolution approach, with the GC constraint applied; Gradient: PGPD promoter set generated using the gradient ascent approach; Gradient-GC: PGPD promoter set generated using the gradient ascent approach, with the GC constraint applied. Points placed along the horizontal line were only measured in the highest-activity bin in FACS-seq. b FACS-seq measurements of promoter activities for PZEV promoter sets designed to maximize induced activity (or corresponding training data sequences). Axis labels referring to PZEV-Induced sequences and designs, but otherwise as in (a); Gradient*: PZEV-Induced promoter set generated using the gradient approach, with an elevated target threshold set relative to other designs. Points placed along the horizontal line were only measured in the highest-activity bin in FACS-seq. c FACS-seq measurements of promoter activities for PZEV promoter sets designed to maximize activation ratio (or corresponding training data sequences). Axis labels referring to PZEV-Activation Ratio sequences and designs, but otherwise as in (b). Source data are available in the Source Data file.
Fig. 4
Fig. 4. Validating activities of individual designed promoters by flow cytometry characterization.
a Promoter activity measurements (as base-10 logarithms) for selected sequences measured both in FACS-seq and by individual flow cytometry. FACS-seq: promoter activities determined from FACS-seq; Individual testing: promoter activities as determined by flow cytometry. Data are presented as mean values ± s.e.m. (n = 3 biologically independent samples). b Individually measured promoter activities (linear scale) determined by flow cytometry for selected PGPD designs and for control sequences (Control). Evolution-GC: randomly chosen sequences from the selected PGPD promoter set designed using the evolution strategy and the GC constraint. For clarity, a subset of those measured covering the measured range of activities is shown here; results for the entire set appear in Supplementary Table 4. Gradient*: randomly chosen sequences from the promoter set designed using the gradient strategy, with an elevated selection threshold. Control sequence names are indicated by text labels. *p = 1.43 × 10−3, two-sided t test. c Individually measured promoter activities (linear scale) determined by flow cytometry for a selected PZEV-Induced design and for control sequences (Control). Evolution-GC: randomly chosen sequences from the selected PZEV-Induced promoter set designed using the evolution strategy and the GC constraint. For clarity, a subset of those measured covering the measured range of activities is shown here; results for the entire set appear in Supplementary Table 4. d Individually measured activation ratios (linear scale) determined by flow cytometry for a selected PZEV-Activation Ratio design and for control sequences (Control). Evolution-GC: randomly chosen sequences from the selected PZEV-Activation Ratio promoter set designed using the evolution strategy and the GC constraint. e Individually measured promoter activities (linear scale) in the uninduced condition for sequences displayed in (d) and for a Background control (pCS4306) expressing mCherry, but not GFP. In panels (be), promoter names, from left to right, are as in Supplementary Tables 4 and 5; bars and error bars in these panels represent the mean and s.e.m. (n = 3 biologically independent samples) of the original log-scale measurements, converted to linear scale. Source data are available in the Source Data file.
Fig. 5
Fig. 5. In silico mutagenesis enables identification of functional motifs in designed sequences.
In panels (d, e), boxes represent interquartile ranges; the bar within each box indicates the median. Whiskers extend to the furthest observation within 1.5 interquartile ranges of the nearest box edge. a In silico mutagenesis reveals patterns of predicted position-wise importance in designed sequences. Above: Normalized position-wise scores (averages of scores for each sequence in the design set) for a PGPD design set. Middle: Normalized position-wise scores for a PZEV-Induced design set. Below: Normalized position-wise scores for a PZEV-Activation Ratio design set. Gray areas: conserved regions, held constant. b Position-dependent features identified as sequence logos in PGPD designs. Above: Schematic of the PGPD construct (as in Fig. 1a). Highlighted regions: areas shown in detail below. Sequence logos, top left: Sequence logo for sequence context 5′ to first GCR1 site. Top right: Sequence logo for sequence context 5′ to second GCR1 site. Bottom left: Sequence logo for sequence context 3′ to TATA motif. Bottom right: Sequence logo for sequence context 5′ to ATG start codon. c Putative transcription factor binding sites identified in PGPD designs. Transcription factors with overlapping binding specificities were pooled as described in the text. Transcription Factor: factor or factors with a Yeastract motif matching the identified sequence. d Median score differentials of TA repeats in the 5′ spacers of PZEV-Induced sequences, by TA repeat length. Non-TA: score differentials for bases outside any TA repeat in the 5′ spacers of tested sequences. 12+: three sequences of length 12, one of length 14, and one of length 16. Numbers under boxplots indicate the number of sequences in each category. e Median score differentials of 4-bp sequences following each of the three ZEV ATF sites (Site 1, Site 2, Site 3) in PZEV-Activation Ratio sequences. TRUE/FALSE: Sequence after ZEV ATF site is/is not GCTA. Numbers under boxplots indicate the number of sequences in each category. Source data are available in the Source Data file.

Similar articles

Cited by

References

    1. Ghodasara A, Voigt CA. Balancing gene expression without library construction via a reusable sRNA pool. Nucleic Acids Res. 2017;45:8116–8127. doi: 10.1093/nar/gkx530. - DOI - PMC - PubMed
    1. Lee ME, Aswani A, Han AS, Tomlin CJ, Dueber JE. Expression-level optimization of a multi-enzyme pathway in the absence of a high-throughput assay. Nucleic Acids Res. 2013;41:10668–78. doi: 10.1093/nar/gkt809. - DOI - PMC - PubMed
    1. Pitera DJ, Paddon CJ, Newman JD, Keasling JD. Balancing a heterologous mevalonate pathway for improved isoprenoid production in Escherichia coli. Metab. Eng. 2007;9:193–207. doi: 10.1016/j.ymben.2006.11.002. - DOI - PubMed
    1. Nielsen AAK, et al. Genetic circuit design automation. Science. 2016;352:aac7341. doi: 10.1126/science.aac7341. - DOI - PubMed
    1. Rantasalo A, Kuivanen J, Penttilä M, Jäntti J, Mojzita D. Synthetic toolkit for complex genetic circuit engineering in Saccharomyces cerevisiae. ACS Synth. Biol. 2018;7:1573–1587. doi: 10.1021/acssynbio.8b00076. - DOI - PMC - PubMed

Publication types

Substances