Fast clustering using adaptive density peak detection

Stat Methods Med Res. 2017 Dec;26(6):2800-2811. doi: 10.1177/0962280215609948. Epub 2015 Oct 16.


Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.

Keywords: Clustering; automatic intrinsic parameter selection; density peak; fast computation; multivariate kernel density estimation.

MeSH terms

  • Algorithms
  • Biostatistics / methods
  • Cluster Analysis*
  • Computer Simulation
  • Databases, Genetic / statistics & numerical data
  • Gene Expression
  • Humans
  • Models, Statistical*
  • Multivariate Analysis
  • Software
  • Statistics, Nonparametric