An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data

Comput Biol Med. 2017 Dec 1:91:213-221. doi: 10.1016/j.compbiomed.2017.10.014. Epub 2017 Oct 23.

Abstract

Background: Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids.

Method: We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids.

Results: We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others.

Conclusion: There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data.

Keywords: Cancer subtype prediction; Centroid initialization; Clustering; Density based; Gene expression data; K-Means.

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • Databases, Genetic
  • Gene Expression Profiling / methods*
  • Humans
  • Neoplasms / genetics*
  • Neoplasms / metabolism*