Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun;93(6):597-610.
doi: 10.1002/cyto.a.23371. Epub 2018 Apr 17.

DAFi: A Directed Recursive Data Filtering and Clustering Approach for Improving and Interpreting Data Clustering Identification of Cell Populations From Polychromatic Flow Cytometry Data

Affiliations
Free PMC article

DAFi: A Directed Recursive Data Filtering and Clustering Approach for Improving and Interpreting Data Clustering Identification of Cell Populations From Polychromatic Flow Cytometry Data

Alexandra J Lee et al. Cytometry A. .
Free PMC article

Abstract

Computational methods for identification of cell populations from polychromatic flow cytometry data are changing the paradigm of cytometry bioinformatics. Data clustering is the most common computational approach to unsupervised identification of cell populations from multidimensional cytometry data. However, interpretation of the identified data clusters is labor-intensive. Certain types of user-defined cell populations are also difficult to identify by fully automated data clustering analysis. Both are roadblocks before a cytometry lab can adopt the data clustering approach for cell population identification in routine use. We found that combining recursive data filtering and clustering with constraints converted from the user manual gating strategy can effectively address these two issues. We named this new approach DAFi: Directed Automated Filtering and Identification of cell populations. Design of DAFi preserves the data-driven characteristics of unsupervised clustering for identifying novel cell subsets, but also makes the results interpretable to experimental scientists through mapping and merging the multidimensional data clusters into the user-defined two-dimensional gating hierarchy. The recursive data filtering process in DAFi helped identify small data clusters which are otherwise difficult to resolve by a single run of the data clustering method due to the statistical interference of the irrelevant major clusters. Our experiment results showed that the proportions of the cell populations identified by DAFi, while being consistent with those by expert centralized manual gating, have smaller technical variances across samples than those from individual manual gating analysis and the nonrecursive data clustering analysis. Compared with manual gating segregation, DAFi-identified cell populations avoided the abrupt cut-offs on the boundaries. DAFi has been implemented to be used with multiple data clustering methods including K-means, FLOCK, FlowSOM, and the ClusterR package. For cell population identification, DAFi supports multiple options including clustering, bisecting, slope-based gating, and reversed filtering to meet various autogating needs from different scientific use cases. © 2018 International Society for Advancement of Cytometry.

Keywords: autogating; cell population identification; constrained clustering; data prefiltering; recursive clustering.

Conflict of interest statement

Conflict Of Interest

The authors have no conflict of interest to declare.

Figures

Figure 1
Figure 1
Design features of DAFi. (A) Steps in the DAFi workflow. In Step 1, putative cell populations are identified by data clustering in multidimensional space, with cell events colored by population membership. In Step 2, a hyper-polygon is provided from combining 2D manual gating boundaries to identify the dataspace region of interest. Cell clusters are selected if their centroids are located within the hyper-polygon (two clusters shown, in light blue and magenta). In Step 3, all cell events associated with the centroids are selected and retained as the filtered population (in red), which is used as the input to the next iteration in Step 4. (B) An example gating hierarchy in which the DAFi framework can be used to identify both predefined (solid lines) and novel (dotted lines) cell populations, and organize them within a user-provided gating hierarchy for simplified annotation and interpretation. (C) Comparison of different ways for identification of the putative CD4+CD25+ regulatory T cells (Tregs): manual gating analysis with abrupt cut-off; single run of K-means clustering (K = 500) applied to whole sample, and DAFi using the K-means for recursive filtering and clustering. The identified Treg cells are colored in red and the remaining cells colored in white. (D) Challenge in identification of user-defined (red rectangle showing gating boundary) CD4+CD25+ regulatory T cells (Tregs) using a single run of data clustering analysis. Centroids of data clusters identified by applying Flow-SOM clustering method (K = 100) to the whole sample are highlighted in red crosses, none of which is in the CD4+CD25+ region. E) DAFi (K-means clustering used) identification of CD4+ T, CD8+ T, CD3+CD56+ T and CD3hiCD56+ T cells. CD4+ T and CD8+ T cells are shown on CD4 vs. CD8 dot plots, while CD3+CD56+ T and CD3hiCD56+ T cells are on CD3 vs. CD56 plots. Cell populations identified by DAFi are colored in red. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 2
Figure 2
Results of DAFi using K-means and FlowSOM in comparison with individual and centralized manual gating analysis. (A) Illustration of the manual gating hierarchy for identifying the 22 predefined cell populations from the 10-color T cell panel, with gating boundaries shown on each 2D dot plot. Along the direction of the red arrows is the sequence of the gates with their parent populations. The cell populations are numbered. Names of the cell types are listed to the right. (B) Results of DAFi using K-means for identifying the corresponding 22 predefined cell populations. Events from the whole sample are colored in white. The black colored dots are events of the parent population, with events identified by DAFi highlighted in red, yellow, green, and blue. (C) Linear regression analysis of percentages of clearly defined cell populations identified by the K-means and the FlowSOM data clustering methods with and without DAFi compared with centralized manual gating. X axis: cell populations sorted based on their average percentage, from the largest to the smallest. Y axis: P values (–log10 transformed) of x-variable in linear regression analysis between percentages of the cell populations identified by four computational methods and the centralized manual gating analysis. (D) Linear regression analysis of percentages of poorly resolved cell populations identified by the K-means and the FlowSOM data clustering methods with and without DAFi compared with centralized manual gating. (E) CV of population percentages across the 24 samples for clearly defined cell populations by six different approaches. (F) CV of population percentages across the 24 samples for poorly resolved cell populations by the six different approaches. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 3
Figure 3
Correlation analysis of DAFi-defined cell population proportions with subject age and gender. (A) Age distribution of participants separated by gender. (B) Proportions of Naïve CD4+ T and Naïve CD8+ T cells (with CD4+ and CD8+ T cells as parents, respectively) versus age with linear regression P values reported. (C) Pearson correlation and linear regression analysis of proportions of T cell subsets with subject age. Parent population definitions of the T-cell subsets can be found in Figure 2A. P values of x-variable in linear regression analysis were –log10 transformed and multiple comparison corrected by Bonferroni correction. (D) Proportion of CD4+ T cells in female and male participants. (E) Correlation between the proportions of effector memory T cells versus Naïve T cells. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 4
Figure 4
Quantification of human immune response to influenza and pneumococcal vaccination using DAFi. From left to right under each vaccine/saline treatment are three selected time points from one individual in each treatment group: 7 days before the treatment (Day 7), and Day 7 and Day 28 after treatment. (A) CD19+ B cells were identified by DAFi using the 2D rectangular gates in FSC/SSC-A and CD19/SSC-A plots illustrated in the first two rows. The two following rows show the B cell events (colored in blue) on IgD versus CD27 and CD20 versus CD138 dot pots. (B) Plasmablast cells identified by DAFi from the CD19+ B cell population. The plasmablasts, defined as IgDCD27hi, are shown in the red box. (C) Percentage of plasmablast cells (with CD19+ B cell as parent) identified across times and treatment groups by DAFi in box plots. (D) Normalized proportions of the plasmablast population (with CD19+ B cell as parent) identified by DAFi and manual gating analysis across times and treatment groups. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 5
Figure 5
Identification of known and novel cell-based biomarkers for LTBI using constrained FLOCK clustering of DAFi filtered populations. (A) Manual gating strategy for identifying CD4+ T cells. The gating path sequentially identifies lymphocytes (FSC-A vs. SSC-A), singlet lymphocytes based on FSC-A/W, singlet lymphocytes based on SSC-A/W, live CD8 T lymphocytes (the DUMP channel includes CD8/CD14/CD19/LiveDead), and CD3+CD4+ T lymphocytes. (B) Manual gating strategy for identifying subset populations from the CD4+ T cells, based on CD25, CCR7, CD45RA, CCR4, CCR6, and CXCR3 expression. (C) Percentages of the three cell subsets (CD4+ T cell population as parent) that have P values < 0.05 (annotated with **) after BH correction identified by the GLM with quasi binomial distribution, between LTBI and HC. (D) Percentages of the two Tetramer+ cell populations which should only be found in the samples of LTBI. (E) Two types of statistical tests were applied to identify which CD4+ T cell subsets are significantly different in abundance between LTBI and HC. The X axis shows the IDs of the cell populations with P values < 0.05 by either statistical test before BH correction. The Y axis shows the P values after BH correction. (F) 2D dot plots of the three CD4+ T cell subsets (percentages shown in part C of this Figure) that differ between LTBI and HC. (G) The two Tetramer+ cell subsets (percentages shown in part D of this Figure) with their events highlighted in red on 2D plots of different markers. Both are very rare (average < 0.1% of CD4+ T cells). (H) t-SNE map of the filtered data. CD4+ T cells are color-coded based on expression level of tetramer to highlight the tetramer+ population in the mid-upper left region. (I) Zoomed-in tSNE map shows that the “island” of the tetramer+ population consists of two separated regions, corresponding to the Pop#18 (highlighted in yellow) and the Pop#65 (highlighted in blue). (J) The hierarchy of cell populations identified by the Citrus method. Cell populations that are significantly different between the LTBI and the HC groups are highlighted in red, which belong to two branches: Branch 1 (8 cell populations) and Branch 2 (2 cell populations). (K) One example sample in the HC group showing the two cell populations with the best P values generated by the Citrus method from Branch 1 and Branch 2, respectively. (L) One example sample in the LTBI group showing the two cell populations with the best P values generated by the Citrus method from Branch 1 and Branch 2, respectively. [Color figure can be viewed at wileyonlinelibrary.com]

Similar articles

See all similar articles

Cited by 2 articles

Publication types

MeSH terms

LinkOut - more resources

Feedback