Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 17;21(1):207.
doi: 10.1186/s13059-020-02091-3.

DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies

Affiliations

DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies

Andre J Faure et al. Genome Biol. .

Abstract

Deep mutational scanning (DMS) enables multiplexed measurement of the effects of thousands of variants of proteins, RNAs, and regulatory elements. Here, we present a customizable pipeline, DiMSum, that represents an end-to-end solution for obtaining variant fitness and error estimates from raw sequencing data. A key innovation of DiMSum is the use of an interpretable error model that captures the main sources of variability arising in DMS workflows, outperforming previous methods. DiMSum is available as an R/Bioconda package and provides summary reports to help researchers diagnose common DMS pathologies and take remedial steps in their analyses.

Keywords: Bioconda; Bioinformatic pipeline; Deep mutational scanning; R package; Statistical model; Variant effect prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic overview of a minimal DMS experiment and the DiMSum pipeline. a Schematic of a basic, plasmid-based microbial growth DMS experiment: (1) construction of a plasmid library of mutant variants and independent transformation or integration of plasmid library into host cells, (2) exposure of cell population to selective conditions, and (3) high-throughput sequencing of samples to obtain variant counts before and after selection, which are used to derive fitness estimates for each variant. Indicated are steps at which bottlenecks could arise, potentially restricting variant pool size or complexity (red roman numerals): [i] inefficient library construction (“library bottleneck”), [ii] inefficient plasmid transformations (“replicate bottleneck”), and [iii] inefficient DNA extraction (“DNA extraction bottleneck”). Unforeseen bottlenecks can lead to over-sequencing [iv] of variant pools and thus underestimation of the errors associated with fitness scores or even appearance of sequencing counts for variants not contained in the original variant pool. b DMS experiments typically have a hierarchical abundance structure, where variants with more mutations are orders of magnitude less abundant than the wild-type sequence or single mutants. c DiMSum flow chart. The WRAP module performs low-level processing of raw DNA sequencing reads to produce sample-wise variant counts. The STEAM module transforms the resulting counts to estimates of variant fitness and associated error. See Additional file 1: Fig. S1-6 for example report plots
Fig. 2
Fig. 2
DiMSum error model estimates multiplicative and additive error sources in fitness scores. a Empirical variance of replicate fitness scores as a function of error estimates based on sequencing counts under Poisson assumptions in a deep mutational scan of TDP-43 (positions 290-331) [6]. Empirical variance (blue dots show average variance in equally spaced bins, error bars indicate avg. variance × (1 ± 2/ # variants per bin)) is over-dispersed compared to baseline expectation of variance being described by a Poisson distribution (black dashed line). The bimodality of the count-based error distribution results from the relatively low number of single nucleotide mutants which have high counts (thus low count-based error) and the many double nucleotide mutants which have low counts (thus higher count-based error). The DiMSum error model (red line) accurately captures the deviations of the empirical variance from Poisson expectation. Inset: bold cyan and magenta lines indicate multiplicative error term contributions to variance corresponding to input and output samples, respectively (dashed thin lines give input or output sample contributions to variance if multiplicative error terms were 1). The horizontal green line indicates the additive error term contribution. The red line indicates the full DiMSum error model. b The same as a but for a deep mutational scan of FOS [20] that shows more over-dispersion. cf Multiplicative (c, e) and additive (in s.d. units, d, f) error terms estimated by the error model on the two datasets. Dots give mean parameters, error bars 90% confidence intervals
Fig. 3
Fig. 3
DiMSum error model performance. Leave-one-out cross-validation to test error model performance. In turn, error models are trained on all but one replicate of a dataset, and z-scores of the differences in fitness scores between the training set f¯train and the remaining test replicate ftest are calculated (i.e., fitness score differences normalized by the estimated error in the training set σtrain and test replicate σtest; importantly, σtest is estimated from error model parameters fit only on the training set replicates). Because fitness scores from replicate experiments should only differ by random chance, if the error models estimate the error magnitude correctly, z-scores should be normally distributed, and corresponding P values from a z test should be uniformly distributed. The tested error models are described in the “Results and discussion” and “Methods” sections. a, c Quantile-quantile plots of z-scores in TDP-43 290-331 library (a) and FOS library (c) compared to the expected normal distribution. b, d Quantile-quantile plots of P values from two-sided z test in TDP-43 290-331 library (a) and FOS library (c) compared to the expected uniform distribution. e Estimated error magnitude relative to the differences observed between replicate fitness scores in twelve DMS datasets in leave-one-out cross-validation (see the “Methods” section). Relative error magnitude = 1 means the estimated magnitude of errors fits the data. Relative error magnitude < 1 means the estimated errors are too small. Boxplots indicate median and 1st and 3rd quartiles (box), and whiskers extend to 1.5× interquartile range
Fig. 4
Fig. 4
Effects of bottlenecks on variant count distributions and fitness scores. a Input sample count distributions of previously published DMS experiments [20, 50]. For FOS and FOS-JUN datasets, counts of single AA variants with one, two, or three nucleotide substitutions in the same codon are shown. For the tRNA dataset, all variants with one, two, or three nucleotide substitutions are shown. Wild-type counts are indicated by the black dashed line. Expected count frequencies purely due to sequencing errors are indicated by red and green dashed lines for single and double nucleotide substitution variants, respectively. Black arrows indicate sets of variants that have likely not been assayed but whose sequencing reads are arising due to sequencing errors. b Simulation of bottlenecks at various steps of the DMS workflow based on a previously published DMS dataset [6]. Scatterplots show input and output sample counts for variants with one or two nucleotide substitutions in the original data or after simulating 3% library, replicate, or DNA extraction bottlenecks (from left to right). Hexagon color indicates the number of nucleotide substitutions and fill number of variants per 2d bin (see legend). Black arrows indicate sets of double nucleotide variants whose sequencing reads solely originate from sequencing errors. Dotted (or dashed) horizontal/vertical lines indicate soft (or hard) variant count thresholds used in downstream DiMSum analyses (see c). c Comparison of fitness scores from simulated datasets with (y-axis) or without (x-axis) the indicated bottlenecks. Variants are categorized by their robustness to filtering with hard (variants have to appear above the threshold in all replicates) or soft thresholds (variants have to appear above the threshold in at least one replicate) of 10 read counts. For the DNA extraction bottleneck, read count thresholds were also applied to output samples. Pearson correlation coefficients are indicated. The dashed line indicates the relationship y = x. Note that correlation coefficients are lower for soft than hard thresholds, because a subset of variants has fewer replicate measurements

Similar articles

Cited by

References

    1. Kinney JB, McCandlish DM. Massively parallel assays and quantitative sequence–function relationships. Annual Review of Genomics and Human Genetics. 2019. p. 99–127. - PubMed
    1. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11:801–807. - PMC - PubMed
    1. Domingo J, Baeza-Centurion P, Lehner B. The causes and consequences of genetic interactions (epistasis) Annu Rev Genomics Hum Genet. 2019;20:433–460. - PubMed
    1. Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein sequence-function relationships. Nature Methods. 2010. p. 741–6. - PMC - PubMed
    1. Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr Biol. 2014;24:2643–2651. - PMC - PubMed

Publication types

LinkOut - more resources