Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep;85:168-188.
doi: 10.1016/j.jbi.2018.07.015. Epub 2018 Jul 17.

Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining

Affiliations
Free PMC article

Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining

Ryan J Urbanowicz et al. J Biomed Inform. .
Free PMC article

Abstract

Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.

Keywords: Classification; Epistasis; Feature selection; Genetic heterogeneity; Regression; ReliefF.

Figures

Figure 1:
Figure 1:
Illustration of the neighbor selection differences between Relief, ReliefF, SURF, SURF*, MultiSURF*, and MultiSURF. Differences include the number of nearest neighbors or the method for selecting ‘near’ or ‘far’ instances for feature scoring. Note that for ReliefF, a k of 3 is chosen but a k of 10 is most common. These illustrations are conceptual and are not drawn to scale.
Figure 2:
Figure 2:
This heatmap illustrates the power of different feature selection algorithms to rank all predictive features in the the top scoring ‘x’ percent of features in the dataset. Results for the noisy 3-way epistatic interaction.
Figure 3:
Figure 3:
Results for all core 2-way epistatic interaction datasets. Keys relevant to all plots are given on the far right. Tick marks delineating algorithm groups are provided for each sub-plot.
Figure 4:
Figure 4:
Results for detecting single feature main effects (A) and additive main effects (B). Keys relevant to all plots are given on the far right. Tick marks delineating algorithm groups are provided for each sub-plot.
Figure 5:
Figure 5:
Results for detecting two independent heterogeneous 2-way epistatic interactions.
Figure 6:
Figure 6:
Results for detecting 2-way epistatic interactions with an increasing number of irrelevant features in the datasets.
Figure 7:
Figure 7:
Results for detecting 2-way, 3-way, 4-way, and 5-way epistatic interactions based on ‘clean’ XOR models.
Figure 8:
Figure 8:
Results for detecting the address bits of different scalings of the Mulitplexer benchmark problem. Each problem is ‘clean’, epistatic, and heterogeneous. Note that all features in these datasets are predictive in at least one subset of the training instances, and power reflects the ability to rank the subset of features that are important in all training instances (address bits), from those that are important only in a given subset (register bits).
Figure 9:
Figure 9:
Results for accommodating continuous (i.e. numerical) endpoints in datasets.
Figure 10:
Figure 10:
Results for accommodating different ‘data type’ issues. Specifically this figure examines extreme examples of class imbalance, missing data, and the combination of discrete and continuous features within the same dataset.

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback