Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 11, 403

Data Reduction for Spectral Clustering to Analyze High Throughput Flow Cytometry Data

Affiliations

Data Reduction for Spectral Clustering to Analyze High Throughput Flow Cytometry Data

Habil Zare et al. BMC Bioinformatics.

Abstract

Background: Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.

Results: We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.

Conclusions: This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.

Figures

Figure 1
Figure 1
Data reduction scheme. (a) Running spectral clustering is impractical on data that contains thousands of points. (b) Faithful sampling picks up a reasonable subset of points such that running spectral clustering is possible on them. However, all information about the local density is lost by considering only these sample points. (c) We assign weights to the edges of the graph; the edges between the nodes in denser regions are weighted considerably higher. The information about the local density is retrieved in this way.
Figure 2
Figure 2
Faithful sampling. (a) Original data from telomere data set before sampling. (b) The distribution of representatives is almost uniform in the space after faithful sampling.
Figure 3
Figure 3
Defining the similarity between two communities and identifying the number of clusters. (a) We define the similarity between two communities c and c' as the sum of pairwise similarities between the members of c and the members of c'. (b) This figure shows the largest eigenvalues of a sample from the stem cell dataset. The number of clusters is estimated according to the knee point of eigenvalues curve. This point is defined as the intersection of the above regression line and the line y = 1. The horizontal coordinate of the knee point estimates the number of spectral clusters.
Figure 4
Figure 4
Comparative clustering of the telomere dataset. (a-c) Proper identification of overlapping populations. Although two populations shown by red and blue contours are overlapping in all bi-variant plots of this 3-dimensional sample, SamSPECTRAL can properly distinguish them by considering multiple parameters simultaneously.(d) SamSPECTRAL can also identify two major subpopulations of granulocytes correctly, as verified by expert analysis. (e) flowMerge does not distinguish between two populations of interest, and (f) FLAME improperly splits the same sample into several clusters.
Figure 5
Figure 5
Comparative clustering of dead cells (PI positive) and live cells (PI negative) in the viability data. (a) SamSPECTRAL could distinguish between dead cells (blue) and live cells (red) properly. (b) flowMerge identified dead cells correctly, but split live cells into two clusters. (c) FLAME did not distinguish between these two population.
Figure 6
Figure 6
Comparative clustering of the GvHD dataset. (Left) Identification of non-elliptical shaped populations. (a) SamSPECTRAL could properly identify the red, non-elliptical population, while (b) flowMerge mixed this population with the one below it. (c) FLAME produced satisfactory results in identifying this population. (Right) Identification of low density populations close to dense populations. (d) SamSPECTRAL and (e) flowMerge could identify the low density population shown in red at the centre of the figure correctly, while (f) FLAME merged this population with the other ones surrounding it.
Figure 7
Figure 7
Comparative identification of a low density population surrounded by much denser populations in the stem cell data set. (a-c) SamSPECTRAL correctly identified the blue, low density population, while (d-f) flowMerge merged it to the yellow, high density population. (g-i) FLAME merged it to the red population. (j-l) The outcome of our modified MCL was similar to that obtained by SamSPECTRAL using classic spectral clustering. This shows that SamSPECTRAL is extensible by substituting classic spectral clustering with other clustering algorithms for weighted graph.
Figure 8
Figure 8
Rare population in the stem cell data set. (a-c) This is a typical sample from the stem cell data set that contains a rare population. In these three dimensional plots, the red dots represent the cells that are positive for all three markers. Only 23/9721 (0.24%) events belong to this population in this sample. SamSPECTRAL could properly identify the rare population in 27/34 (79.4%) samples from the stem cell data set.
Figure 9
Figure 9
Performance of SamSPECTRAL on synthetic data. (a) This synthetic two dimensional data consists of a normal distribution with 30,000 points, four normal distribution each with 300 points and a uniform background noise with 4000 points. (b) Around 3000 sample points are picked up by faithful sampling. These are distributed almost uniformly in the space, therefore, almost all information about density will be lost if one considers only the samples points. (c) The final outcome of SamSPECTRAL confirms that the information about density could be retrieved by properly assigning weights to the edges of the graph. The high density cluster is shown in red and the surrounding sparser clusters are shown in yellow, light blue, green and black.
Figure 10
Figure 10
Comparing Uniform sampling with faithful sampling. Directly applying classical spectral clustering is not efficient on this sample of the stem cell dataset which contains 48000 cytometry events in 3 dimensions. (a) Although only 2115 data points were selected by faithful sampling, each population has a considerable number of representatives in the selected points. (b) 3000 points were selected by uniform sampling. The low density population in the middle of the plot consists of only 55 sample points resulting in mixing this population with a high density one incorrectly (d). (c) The result of SamSPECTRAL on the original data is satisfactory because the low density red population and other high density populations are identified properly.

Similar articles

See all similar articles

Cited by 50 articles

See all "Cited by" articles

References

    1. Hawley TS, Hawley RG. Flow Cytometry Protocols, Methods in Molecular Biology. 2. Humana Press; 2005.
    1. Perfetto SP, Chattopadhyay PK, Roederer M. Seventeen-colour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004;4(8):648–655. doi: 10.1038/nri1416. - DOI - PubMed
    1. Bashashati A, Brinkman R. A survey of flow cytometry data analysis methods. Advances in Bioinformatics. 2009;2009:1–19. doi: 10.1155/2009/584603. - DOI - PMC - PubMed
    1. Klinke D II, Brundage K. Scalable analysis of flow cytometry data using R/Bioconductor. Cytometry Part A. 2009;75(8):699–706. doi: 10.1002/cyto.a.20746. - DOI - PMC - PubMed
    1. Lugli E, Roederer M, Cossarizza A. Data analysis in flow cytometry: The future just started. Cytometry Part A. 2010;77(7):705–13. doi: 10.1002/cyto.a.20901. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback