Estimation of start and stop numbers for cluster resolution feature selection algorithm: an empirical approach using null distribution analysis of Fisher ratios

Anal Bioanal Chem. 2017 Nov;409(28):6699-6708. doi: 10.1007/s00216-017-0628-8. Epub 2017 Sep 29.


Cluster resolution feature selection (CR-FS) is a hybrid feature selection algorithm which involves the evaluation of ranked variables via sequential backward elimination (SBE) and sequential forward selection (SFS). The implementation of CR-FS requires two main inputs, namely, start and stop number. The start number is the number of the highly ranked variables for the SBE while the stop number is the point at which the search for additional features during the SFS stage is halted. The setting of these critical parameters has always relied on trial and error which introduced subjectivity in the results obtained. The start and stop numbers are known to vary with each dataset. Drawing inspiration from overlapping coefficients, a method for comparing two probability density functions, empirical equations toward the estimation of start and stop number for a dataset were developed. All of the parameters in the empirical equations are obtained from the comparisons of the two probability density functions except the constant termed d. The equations were optimized using three real-world datasets. The optimum range of d was determined to be 0.48 to 0.57. An implementation of CR-FS using two new datasets demonstrated the validity of this approach. Partial least squares discriminant analysis (PLS-DA) model prediction accuracies increased from 90 and 96 to 100% for both datasets using start and stop numbers calculated with this approach. Additionally, there was a twofold increase in the explained variance captured in the first two principal components. Graphical abstract Here, we describe how to determine the start and stop numbers for an automated feature selection routine, ensuring that you get the best model you can for your data with minimal effort.

Keywords: Chemometrics; Classification; Cluster resolution; Feature selection; Fisher ratio; Overlapping coefficient.