Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 11;11(1):2338.
doi: 10.1038/s41467-020-15851-3.

Deep Learning Enables Accurate Clustering With Batch Effect Removal in Single-Cell RNA-seq Analysis

Affiliations
Free PMC article

Deep Learning Enables Accurate Clustering With Batch Effect Removal in Single-Cell RNA-seq Analysis

Xiangjie Li et al. Nat Commun. .
Free PMC article

Abstract

Single-cell RNA sequencing (scRNA-seq) can characterize cell types and states through unsupervised clustering, but the ever increasing number of cells and batch effect impose computational challenges. We present DESC, an unsupervised deep embedding algorithm that clusters scRNA-seq data by iteratively optimizing a clustering objective function. Through iterative self-learning, DESC gradually removes batch effects, as long as technical differences across batches are smaller than true biological variations. As a soft clustering algorithm, cluster assignment probabilities from DESC are biologically interpretable and can reveal both discrete and pseudotemporal structure of cells. Comprehensive evaluations show that DESC offers a proper balance of clustering accuracy and stability, has a small footprint on memory, does not explicitly require batch information for batch effect removal, and can utilize GPU when available. As the scale of single-cell studies continues to grow, we believe DESC will offer a valuable tool for biomedical researchers to disentangle complex cellular heterogeneity.

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The workflow of DESC.
a Overview of the DESC framework. DESC starts with parameter initialization in which a stacked autoencoder is used for pretraining and learning a low-dimensional representation of the input gene expression matrix. The resulting encoder is then added to the iterative clustering neural network to cluster cells iteratively. The final output of DESC includes cluster assignment, the corresponding probabilities for cluster assignment for each cell, and the low-dimensional representation of the data; b–d The t-SNE plots of DESC for the macaque retina scRNA-seq data generated by Peng et al. The plots are colored by macaque id (b), sample id (c), and region (d). e The ARIs of different methods. The ARIs were calculated when taking different information (macaque id, sample id, region id) as batch in analysis, and “All” was calculated when no batch information was provided in analysis.
Fig. 2
Fig. 2. Comparison of the robustness of different methods for batch definition based on the macaque retina scRNA-seq data.
a The KL divergences calculated for macaque id (left plot), region id (middle plot), and sample id (right plot) when taking macaque id, region id, or sample id as the batch definition in analysis for each method. The box represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent the 1.5 times interquartile range. The word “All” in legend for DESC and scVI indicates that they take the whole dataset as input without considering any batch information in analysis. This figure shows that DESC yields robust results for batch effect removal no matter what batch information was provided in analysis. However, other methods are sensitive to the choice of batch definition. b The t-SNE plots showing region distribution for different methods when region was treated as batch in analysis.
Fig. 3
Fig. 3. The Comparison between DESC and scVI when batch information was not provided in the analysis of macaque retina data.
a The t-SNE results of DESC when no batch information provided and the cells are colored by cell type, macaque id, region id, and sample id, respectively; b scVI when batch information was not provided and cells are colored by cell type, macaque id, region id, and sample id, respectively. The cells are mixed well by macaque id, region, and sample id in DESC, but are completely separated by macaque id, region, and sample id in scVI. These results indicate that DESC is able to remove complex batch effect without explicit use of batch information.
Fig. 4
Fig. 4. Clustering results for the pancreatic islet data generated from different scRNA-seq protocols.
a The t-SNE plots in which cells were colored by batch. b The ARI values of different methods. c The KL divergence of different methods. d T-SNE plots showing DESC removes batch effect gradually over iterations. e The KL divergence over iteration. The box in c and e represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent the 1.5 times interquartile range.
Fig. 5
Fig. 5. The results of PBMC data generated by Kang et al..
a, b DESC clustering without taking batch information in the analysis. a is colored by Batch ID and b is colored by cell type. c Volcano plots of differential expression analysis between control and stimulated conditions for each cell type. Highlighted are differential expression genes using Wilcoxon rank sum test with fold change > e0.25 and FDR adjusted p value < 10−50. CD14+ monocytes have the most number of differentially expressed genes compared with other cell types. d The KL divergence calculated using all cells and using non-CD14+ monocytes only. The box represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent the 1.5 times interquartile range.
Fig. 6
Fig. 6. The results of mouse bone marrow data generated by Paul et al..
a The t-SNE plot showing the maximum probabilities of cluster assignments of cells. The maximum probability is the probability for the cluster that is assigned with the highest probability by DESC. b The t-SNE plots of clustering results by DESC and scVI. Compared with scVI, DESC yields more accurate clustering result for DC, lymph, and Mk. In scVI, the clustering result is more diffused, and Mk cells are mixed together with GMP cells. In contrast, DESC clearly separated DC, Lymph, and Mk cells from the other cell clusters. c, d The t-SNE plots of true cell-type labels (obtained from the original publication) for DESC and scVI. e, f The t-SNE plots of true cell-type labels with pseudotime ordering (obtained from the original publication) for DESC and scVI. Ery erythrocyte, MEP megakaryocyte/erythrocyte progenitors, Mk megakaryocyte, GMP granulocyte/macrophage progenitors, DC dendritic cell, Baso basophils, Mo monocyte, Neu neutrophils, Eos eosinophils; Lymph lymphocyte.
Fig. 7
Fig. 7. The estimated pseudotime plots for the human monocyte data.
Shown are the results of Monocle3 estimated pseudotime using a low-dimensional representation from DESC as input; b CCA components from Seurat 3.0 as input; c CCA components from method CCA as input; d PCA components of corrected gene expression values from MNN as input; e low-dimensional representation from scVI as input; f low-dimensional representation from BERMUDA as input; g low-dimensional representation from scanorama as input; h raw gene count matrix as input. Default parameters in Monocle3 were used to conduct dimension reduction and pseudotime estimation. i KL divergences that measure of the degree of batch effect removal for different methods. “Raw” represents the output of Monocle3 using the raw gene count matrix as the input. The box represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent the 1.5 times interquartile range.
Fig. 8
Fig. 8. The expression of marker gene S100A8 (classical monocytes) over pseudotime for different methods for cells across batches.
The black line is the smoothed expression curve when cells from all batches are included. The red, green, and blue lines are the smoothed expression curves for cells from T1, T2, and T3, respectively. Pseudotime from all methods was scaled to [0, 1] for comparison. a Low-dimensional representation from DESC as input; b CCA components from Seurat 3.0 as input; c CCA components from method CCA as input; d PCA components of corrected gene expression values from MNN as input; e low-dimensional representation from scVI as input; f low-dimensional representation from BERMUDA as input; g low-dimensional representation from scanorama as input; h raw gene count matrix as input.
Fig. 9
Fig. 9. Comparison of memory usage (first column) and running time (second and third columns).
a–c The number of batches for analyzed samples is 2. Analyzed data were from Kang et al.. d–f The number of batches for analyzed samples is 4. g–i The number of batches for analyzed samples is 30. For d–i the analyzed data were form Peng et al. in which there are four batches when taking macaque id as batch definition and 30 batches when taking sample as batch definition. Because DESC, scVI, and BERMUDA are deep learning based methods, we put them together for ease of comparison. Remark: the reason that the running time of batch = 30 is smaller than that of batch = 4 for DESC is because when the data were standardized by sample id (i.e., when batch = 30), the algorithm converged quickly before reaching to the maximum number of epochs (300). The “Error” in the bar plot in g and i indicates that there was an error when using Seurat 3.0. This is because the numbers of cells in some batches are very small. The “asterisk” above the bar plot in d, f, g, and i indicates that the corresponding method broke due to memory issue (i.e., cannot allocate memory). Therefore, the recorded time is the computing time until the method broke. When the number of batches is 30, BERMUDA always throws out an error when the number of cells is less than 8000, so we only report BERMUDA when the number of cells ≥ 10,000. In addition, the reported running time and memory usage only include clustering procedure and not include the procedure of computing t-SNE or UAMP. All reported time and memory usage related to this figure were analyzed on our workstation Ubuntu 18.04.1 LTS with Intel® Core(TM) i7-8700K CPU @ 3.70 GHz and 64 GB memory.
Fig. 10
Fig. 10. Comparison with different methods for batch effect removal.
The second column was computed based on results shown in Fig. 1e, and the error bar is the standard error of ARI when using different information as batch in analysis. The fifth column was computed based on results shown in Fig. 9a. For each method, memory usage was shown for 1000, 2000, 5000, 8000, 10,000, and 20,000 cells, respectively.

Similar articles

See all similar articles

References

    1. Regev A, et al. The Human Cell Atlas. Elife. 2017;6:e27041. doi: 10.7554/eLife.27041. - DOI - PMC - PubMed
    1. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19:562–578. doi: 10.1093/biostatistics/kxx053. - DOI - PMC - PubMed
    1. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. - DOI - PMC - PubMed
    1. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. - DOI - PMC - PubMed
    1. Stuart T, et al. Comprehensive integration of single-cell data. Cell. 2019;177(1888-1902):e1821. - PMC - PubMed
Feedback