Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 30;11(1):6077.
doi: 10.1038/s41467-020-19894-4.

muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data

Affiliations

muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data

Helena L Crowell et al. Nat Commun. .

Abstract

Single-cell RNA sequencing (scRNA-seq) has become an empowering technology to profile the transcriptomes of individual cells on a large scale. Early analyses of differential expression have aimed at identifying differences between subpopulations to identify subpopulation markers. More generally, such methods compare expression levels across sets of cells, thus leading to cross-condition analyses. Given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making sample-level inferences, termed here as differential state analysis; however, it is not clear which statistical framework best handles this situation. Here, we surveyed methods to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated pseudobulk data. To evaluate method performance, we developed a flexible simulation that mimics multi-sample scRNA-seq data. We analyzed scRNA-seq data from mouse cortex cells to uncover subpopulation-specific responses to lipopolysaccharide treatment, and provide robust tools for multi-condition analysis within the muscat R package.

PubMed Disclaimer

Conflict of interest statement

L.C., C.R., and D.M. are full-time employees of Roche. H.L.C., C.S., P.L.G., D.C., and M.D.R. declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic overview of muscat’s simulation framework.
a Given a count matrix of features by cells and, for each cell, predetermined cluster (subpopulation) identifiers as well as sample labels (0), dispersion and sample-wise means are estimated from a negative binomial distribution for each gene (for each subpopulation) (1.1); and library sizes are recorded (1.2). From this set of parameters (dispersions, means, and library sizes), gene expression is sampled from a negative binomial distribution. Here, genes are selected to be “type” (subpopulation-specifically expressed; e.g., via marker genes), “state” (change in expression in a condition-specific manner), or equally expressed (relatively) across all samples (2). The result is a matrix of synthetic gene expression data (3); b Differential distributions are simulated from an NB distribution or mixtures thereof, according to the definitions of random variables X, Y, and Z. c t-SNE plots for a set of simulation scenarios with varying percentage of “type” genes (top), DS genes (middle), and the difference in the magnitude (logFC) of DS between subpopulations (bottom). d Schematic overview of cell- and sample-level approaches for DS analysis. Top panels show a schematic of the data distributions or aggregates across samples (each violin is a group or sample; each dot is a sample) and conditions (blue or orange). The bottom panels highlight the data organization into sub-matrix slices of the original count table.
Fig. 2
Fig. 2. DS method performance across p value adjustment types, differential distribution categories, and subpopulation-sample cell counts.
All panels show observed overall true positive rate (TPR) and false discovery rate (FDR) values at FDR cutoffs of 1%, 5%, and 10%; dashed lines indicate desired FDRs (i.e., methods that control FDR at their desired level should be left of the corresponding dashed lines). For each panel, performances were averaged across five simulation replicates, each containing 10% of DS genes (of the type specified in the panel labels of (a), and 10% of DE genes for (b); see Fig. 1b for further details). a Comparison of locally and globally adjusted p values, stratified by DS type. Performances were calculated from subpopulation-level (locally) adjusted p values (top row) and cross-subpopulation (globally) adjusted p values (bottom row), respectively. b Performance of detecting DS changes according to the number of cells per subpopulation-sample, stratified by the method.
Fig. 3
Fig. 3. Between-method concordance.
Upset plot obtained from intersecting the top-n ranked gene-subpopulation combinations (lowest p value) across methods and simulation replicates. Here, n = min(n1n2), where n1 = number of genes simulated to be differential, and n2 =  number of genes called differential at FDR < 0.05. Shown are the 40 most frequent interactions; coloring corresponds to (true) simulated gene categories. The bottom right annotation indicates method types (PB pseudobulk (aggregation-based) methods, MM mixed models, and AD Anderson–Darling tests).
Fig. 4
Fig. 4. DS analysis of cortex tissue from vehicle- and LPS-treated mice.
a Shared color and shape legend of subpopulation and group IDs. b UMAP visualization colored by subpopulation (left) and group ID (right). c Pseudobulk-level Multidimensional Scaling (MDS) plot. Each point represents one subpopulation-sample instance; points are colored by subpopulation and shaped by group ID. d Heatmap of pseudobulk-level log-expression values normalized to the mean of vehicle samples; rows correspond to genes, columns to subpopulation-sample combinations. Included is the union of DS detections (FDR < 0.05, ∣logFC∣ >  1) across all subpopulations. Data are split horizontally by subpopulation and vertically by consensus clustering ID (of genes); top and bottom 1% logFC quantiles were truncated for visualization. Bottom-row violin plots represent cell-level effect coefficients computed across all differential genes, and scaled to a maximum absolute value of 1 (each violin is a sample; coloring corresponds to group ID); effect coefficients summarize the extent to which each cell reflects the population-level fold-changes (see “Methods”).
Fig. 5
Fig. 5. Summary of DS method performance across a set of evaluation criteria.
Methods are ranked from left to right by their weighted average score across criteria, with the numerical encoding good = 2, intermediate = 1, and poor/NA = 0. Evaluation criteria (y-axis) comprise DS detection sensitivity (TPR) and specificity (FDR) for each type of differential distribution, uniformity of p value distributions under the null (null simulation), concordance between simulated and estimated logFCs (logFC estimation), ability to accommodate complex experimental designs (complex design), and runtimes (speed). Top annotation indicates method types (PB pseudobulk (aggregation-based) methods, MM mixed models, AD Anderson–Darling tests). Null simulation, logFC estimation, complex design, and runtimes received equal weights of 0.5; TPR and FDR were weighted according to the frequencies of modalities in scRNA-seq data reported by Korthauer et al.:  ~75% unimodal,  ~5% trimodal, and  ~25% bimodal, giving weights of 0.75 for DE, 0.125 for DP and DM, and 0.05 for DB.

Similar articles

Cited by

References

    1. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 2015;16:133–145. doi: 10.1038/nrg3833. - DOI - PubMed
    1. Morris, S. A. The evolving concept of cell identity in the single cell era. Development146, dev169748 10.1242/dev.169748 (2019). - PubMed
    1. Xia, B. & Yanai, I. A periodic table of cell types. Development146 (2019). - PMC - PubMed
    1. Kotliar D, et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife. 2019;8:e43803. doi: 10.7554/eLife.43803. - DOI - PMC - PubMed
    1. Tiklová K, et al. Single-cell RNA sequencing reveals midbrain dopamine neuron diversity emerging during mouse brain development. Nat. Commun. 2019;10:581. doi: 10.1038/s41467-019-08453-1. - DOI - PMC - PubMed

Publication types