Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr;16(4):311-314.
doi: 10.1038/s41592-019-0353-7. Epub 2019 Mar 18.

Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Affiliations

Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Yue Deng et al. Nat Methods. 2019 Apr.

Abstract

Recent advances in large-scale single-cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states in heterogeneous tissues. We present scScope, a scalable deep-learning-based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests Statement

The authors declare no competing interests.

Figures

Fig. 1.
Fig. 1.. Overview of scScope architecture and performance on simulated datasets
a) Overview of the recurrent network architecture of scScope. An input single-cell profile with dropout gene measurements (white entries) is corrected for batch effects, then the corrected vector x is sequentially processed by an encoder layer (for feature extraction), decoder layer (for noise reduction) and imputation layer (for dropout imputation). The imputed vector v is added back to the batch-corrected input profile x to fill in missing values. This process proceeds recursively T times to produce a final signature feature vector output h used for biological discovery, such as identification of phenotypically distinct subpopulations. b) Comparison of run time on dataset of different scales. Datasets of varying size were randomly subsampled from a dataset containing 1.3 million mouse brain cells and used for comparison (Methods). c) Clustering accuracy for 2K scRNA-seq data with varying fraction of sparsity. Splatter was used to generate 2K cells with 3 subpopulations with varying dropout rates (Supplementary Table 3). Accuracy measurement is based on adjusted Rand index (Methods). For each simulated condition, n = 10 random replicates were simulated; Box plot: median (center line), interquartile range (box) and minimum-maximum range (whiskers). d) Clustering accuracy for 1M scRNA-seq data with varying fraction of rare subpopulations. The simulation strategy of SIMLR was used to generate 1M cells. Dropout rate = 0.5; total number of clusters = 50, number of rare subpopulations = 5; replicate number n = 10. For MAGIC, ZINB-WaVE and SIMLR, the 1M datasets were randomly down sampled to 10K, and PhenoGraph was used for de novo cell subpopulation discovery. For methods run on the 1M dataset, a scalable clustering approach was used to identify subpopulations (Methods). Box plot as in 1c.
Fig. 2.
Fig. 2.. Evaluation of methods on experimental scRNA-seq datasets.
a) Analysis of batch correction. Comparisons of (top) batch mixing entropy and (bottom) computational runtime without or with batch correction using mouse lung tissue scRNA-seq dataset. Top box plot: median (center line), interquartile range (box) and minimum-maximum range (whiskers); n = 100 replicates of 100 randomly selected cells across all batches. Bottom: run time to process whole dataset. b) Analysis of imputation accuracy for different gene expression levels. Comparison of imputation error for dropout genes with different (octiles) gene expression levels using the cord blood mononuclear cell (CBMC) scRNA-seq dataset. c) Analysis of subpopulation identification for increasing gene depth. Using the mouse cell atlas, we compared the ability of different approaches to identify the 51 known tissues in the atlas. Black color: provided software package was unable to complete the task.
Fig. 3.
Fig. 3.. Application of scScope to explore biology in 1.3M mouse brain dataset.
a) Fractions of three major cell types (glutamatergic neurons, GABAergic neurons and non-neurons) identified by scScope and comparisons with reported neuron fractions by previous SPLiT-seq research. b) Left: scScope results visualized using tSNE on n = 30K cells (randomly sampled from the full dataset). Clusters were divided to three major types based on gene markers. Right: Large-scale annotation of clusters to known cell types according to top 10 overexpressed genes. Violin plots: expression distribution of marker genes for discovered clusters. Vertical axis (left): clusters with known cell type annotations and corresponding cell numbers. Horizontal axis: differentially expressed marker genes across shown clusters. Vertical axis (right): cluster annotation based on previously reported cell-subtype-specific genes.

Similar articles

Cited by

References

    1. Gawad C, Koh W & Quake SR Single-cell genome sequencing: current state of the science. Nature Reviews Genetics 17, 175–188 (2016). - PubMed
    1. Saliba A-E, Westermann AJ, Gorski SA & Vogel J Single-cell RNA-seq: advances and future challenges. Nucleic Acids Research 42, 8845–8860 (2014). - PMC - PubMed
    1. Shalek AK et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–+ (2014). - PMC - PubMed
    1. Macosko EZ et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214 (2015). - PMC - PubMed
    1. Zheng GXY et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8 (2017). - PMC - PubMed

Publication types

MeSH terms