Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data

Int J Mol Sci. 2019 Dec 20;21(1):79. doi: 10.3390/ijms21010079.

Abstract

Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.

Keywords: computational techniques; data science; emergent self-organizing maps; flow cytometry; high-dimensional data sets; immunological research; machine-learning; t-distributed stochastic neighbor embedding.

MeSH terms

  • Algorithms
  • Antigens, CD / analysis
  • Computational Biology / methods*
  • Datasets as Topic
  • Flow Cytometry / methods*
  • Humans
  • Machine Learning*
  • Stochastic Processes

Substances

  • Antigens, CD