Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 9;19(1):93.
doi: 10.1186/s12859-018-2092-7.

An Interpretable Framework for Clustering Single-Cell RNA-Seq Datasets

Affiliations
Free PMC article

An Interpretable Framework for Clustering Single-Cell RNA-Seq Datasets

Jesse M Zhang et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: With the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering.

Results: In this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of "cell type", allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method's efficacy and computational efficiency.

Conclusion: DendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit .

Keywords: Clustering; Feature selection; Interpretability; Single-cell RNA-seq.

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

JMZ is a past employee of BD Genomics. JF, HCF, and DR are past and current employees of BD Genomics.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Overview of DendroSplit. a The workflow starts by preprocessing a N×M matrix of gene expressions before computing cell-cell pairwise distances, resulting in a N×N distance matrix. The distance matrix is fed into a hierarchical clustering algorithm to generate a dendrogram. The dynamic splitting step involves recursively splitting the tree into smaller subtrees corresponding to potential clusters. Finally, the subtrees are merged together during a cleanup step to produce final clusters. b A split corresponds to the partitioning of a larger cluster into two smaller clusters. A split is only deemed valid if the separation score, a metric for how well-separated two populations are, is above a predefined split threshold. Leveraging biological intuition, we rank how well each gene distinguishes the two subpopulations based on independent Welch’s t-tests. We use the – log of the smallest p-value obtained as our separation score due to its interpretability and practical effectiveness. A split threshold of 10 would work for the example shown here. c During the merge step, the clusters obtained from the split step are compared to one another using pairwise separation scores. If the closest two clusters are not sufficiently far apart based on a predefined merge threshold, they are merged together and the process is repeated. When all clusters are sufficiently far apart, the algorithm terminates and the final labels are output
Fig. 2
Fig. 2
Synthetic Datasets. a The DendroSplit approach is applied to a synthetic 2-dimensional dataset where pairwise distances are equal to the Euclidean distances between points. The dendrogram splitting process can be visualized using a tree, and each box in the tree represents a step in the algorithm where a larger cluster is partitioned into the red and green clusters. For each of the two features (dimensions), the split is evaluated based on the distributions of that feature within the candidate clusters. Teal points are “background” points and not considered for a given step. b DendroSplit is evaluated on two other 2-dimensional synthetic datasets and recovers the correct number of clusters both times. Euclidean distance is used. c We note that DendroSplit cannot overcome poor preprocessing and distance metric selection. Directly computing Euclidean distances for the points in the concentric circle dataset would yield poor performance, but using Euclidean distance after some preprocessing (e.g. mapping each point to its distance from the center) yields the correct results. For the examples shown here, the merge thresholds are 10, and the split thresholds are 40 for (a) and 30 for (b, c)
Fig. 3
Fig. 3
Gold standard datasets. DendroSplit is evaluated on four single-cell RNA-Seq datasets where the labels are highly likely to be correct [2, 5, 8, 9, 34]. In addition to visual inspection, cluster quality is evaluated using the adjusted Rand index (ARI) based on the true labels. We observe here that the split step tends to generate more clusters than expected, shrinking the ARI. Additionally, due to how the dendrogram is constructed, a cell may end up in its own cluster and is consequentially labeled as a “Singleton”. The merge step treats both these cases. The cells are visualized using either the first two principal components (PC) or the first two t-distributed stochastic neighbor embedding [63] (t-SNE) components. The reported runtimes include computation of the pairwise distance matrices
Fig. 4
Fig. 4
Exploratory analysis on Patel et al. dataset. a DendroSplit is evaluated on Patel et al.’s dataset of 430 cells, 5948 features (genes) from five primary human glioblastomas [7]. Gene expression is quantified using TPM. The split and merge thresholds are 20 and 15, respectively, and the analysis takes 9.64 seconds to run. The numbers in the legends represent the number of points in the corresponding clusters. For the split step, the names of the clusters are generated based on the position of the subtrees in the dendrogram. “r” represents the root node, and “rRL” represents the subtree found at the left child of the right child of the root. b We can evaluate how cells were partitioned at each step of the split procedure, and DendroSplit can also show us the within-cluster distributions of the gene that validates the split. c We can also evaluate how clusters obtained after the split step were combined during the merge procedure, and DendroSplit can show us the distributions of the most distinguishing gene between two merged clusters
Fig. 5
Fig. 5
Larger datasets. DendroSplit is evaluated on three large single-cell datasets where the labels were assigned using computational methods [12, 17, 20]. Because M independent Welch’s t-tests are performed at each potential split, the runtime of the algorithm scales linearly with M. As demonstrated with the Zeisel et al. dataset, decreasing M by a factor of 16.6 likewise decreases the runtime by the same factor. The datasets here are preprocessed by filtering out genes using the procedure described by Macosko et al. [20]. This preprocessing step also improves the quality of the distance metric used, resulting in better performance
Fig. 6
Fig. 6
Exploratory analysis on PBMC dataset. a After gene selection and removal of cells with less than 50 counts across all genes, DendroSplit generates clusters for Zheng et al.’s remaining dataset of 17426 cells, 908 features (genes) from fresh peripheral blood mononuclear cells (PBMCs) [21]. Gene expression is quantified using UMI counts. The split and merge thresholds are 200 and 100, respectively, and the analysis takes 119.97 seconds to run. b 5 of the 15 recorded valid splits are shown along with the expression levels of the top 4 genes used for validating each split. The reported runtimes include computation of the pairwise distance matrices
Fig. 7
Fig. 7
Split threshold sweeping. Since DendroSplit saves all relevant information at each valid split (e.g. p-values from Welch’s t-tests and IDs of cells being compared), we can run the method with a small split threshold to gather information about several potential splits. From this information, we can generate the labels we would have obtained after the split step had we run the algorithm with a larger split threshold. a For the 704×13473 Kolodziejczyk et al. [5] dataset, running DendroSplit using a split threshold of 2 takes 37.63 seconds, and generating a set of new labels takes 0.032 seconds. b For the 3005×1202 Zeisel et al. dataset, running DendroSplit using a split threshold of 2 takes 13.48 seconds, and generating new labels takes 0.403 seconds. For both datasets, the DendroSplit runtimes include computation of the pairwise distance matrices

Similar articles

See all similar articles

Cited by 5 articles

References

    1. Yuan GC, et al.Challenges and emerging directions in single-cell analysis. Genome Biol. 2017; 18:84. 10.1186/s13059-017-1218-y. - PMC - PubMed
    1. Biase FH, Cao X, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome Res. 2014;24:1787–96. doi: 10.1101/gr.177725.114. - DOI - PMC - PubMed
    1. Trapnell C, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32:381–6. doi: 10.1038/nbt.2859. - DOI - PMC - PubMed
    1. Goolam M, et al.Heterogeneity in oct4 and sox2 targets biases cell fate in 4-cell mouse embryos. Cell. 2016; 165:61–74. http://www.sciencedirect.com/science/article/pii/S0092867416300617. - PMC - PubMed
    1. Kolodziejczyk AA, et al.Single cell rna-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015; 17:471–85. http://www.sciencedirect.com/science/article/pii/S193459091500418X. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback