gEM/GANN: A multivariate computational strategy for auto-characterizing relationships between cellular and clinical phenotypes and predicting disease progression time using high-dimensional flow cytometry data

Dong Ling Tong; Graham R Ball; A Graham Pockley

doi:10.1002/cyto.a.22622

gEM/GANN: A multivariate computational strategy for auto-characterizing relationships between cellular and clinical phenotypes and predicting disease progression time using high-dimensional flow cytometry data

Cytometry A. 2015 Jul;87(7):616-23. doi: 10.1002/cyto.a.22622. Epub 2015 Jan 8.

Authors

Dong Ling Tong¹, Graham R Ball¹, A Graham Pockley¹

Affiliation

¹ The John van Geest Cancer Research Centre, Nottingham Trent University, Nottingham, NG11 8NS, United Kingdom.

PMID: 25572884
DOI: 10.1002/cyto.a.22622

Abstract

The dramatic increase in the complexity of flow cytometric datasets requires new computational approaches that can maximize the amount of information derived and overcome the limitations of traditional gating strategies. Herein, we present a multivariate computational analysis of the HIV-infected flow cytometry datasets that were provided as part of the FlowCAP-IV Challenge using unsupervised and supervised learning techniques. Out of 383 samples (stimulated and unstimulated), 191 samples were used as a training set (34 individuals whose disease did not progress, and 157 individuals whose disease did progress). Using the results from the training set, the participants in the Challenge were then asked to predict the condition and progression time of the remaining individuals (45 "nonprogressors" and 147 "progressors"). To achieve this, we first scaled down data resolution and then excluded doublet cells from the analysis using Expectation Maximization approaches. We then standardized all samples into histograms and used Genetic Algorithm-Neural Network to extract feature sets from the datasets, the reliability of which were examined using WEKA-implemented classifiers. The selected feature set resulted in a high sensitivity and specificity for the discrimination of progressors and nonprogressors in the training set (average True Positive Rate = 1.00 and average False Positive Rate = 0.033). The capacity of the feature set to predict real-time survival time was better when using data from the "unstimulated" training set (r = 0.825). The P-values and 95% confidence interval log-rank ratios between actual and predicted survival time in the test set were 0.682 and 0.9542 ± 0.24 for the unstimulated dataset, and 0.4451 and 0.9173 ± 0.23 for the stimulated dataset. Our analytic strategy has demonstrated a promising capacity to extract useful information from complex flow cytometry datasets, despite a significance imbalance and variation between the training and test sets.

Keywords: Key terms: FlowCAP; cluster analysis; expectation maximization; feature identification; genetic algorithm-neural network; imbalance; multidimensional; survival time.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Computational Biology / methods*
Disease Progression*
Electronic Data Processing / methods*
Flow Cytometry / methods*
HIV Infections / diagnosis*
Humans
Multivariate Analysis
Prognosis