Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 15;35(8):1358-1365.
doi: 10.1093/bioinformatics/bty788.

STatistical Inference Relief (STIR) Feature Selection

Affiliations
Free PMC article

STatistical Inference Relief (STIR) Feature Selection

Trang T Le et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.

Results: We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.

Availability and implementation: Code and data available at http://insilico.utulsa.edu/software/STIR.

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1.
Fig. 1.
Comparison of the pseudo-code of the original ReliefF algorithm as implemented in ReBATE (Urbanowicz et al., 2018a) (Algorithm 1, left) versus the reformulated version of ReliefF (Algorithm 2, right, based on Eq. 7 – line 13). The reformulated version allows for algorithm optimization by precomputing miss and hit matrices (Algorithm 2, line 7 – Section 2.1.4) and using a vectorized diff function (Algorithm 2, lines 11 and 12). The sums in line 13 are over all elements of Ha and Ma (all pairs of neighbors for all instances). The pseudo-code for STIR (Eq. 10) works similarly
Fig. 2.
Fig. 2.
STIR versus permutation-test multiSURF and univariate t-test. Comparison of the performance (True Negative Rate, Precision, and Recall) of STIR (with multiSURF neighborhood, mauve), permutation test of multiSURF (blue), and univariate t-test (green) to detect functional attributes. Each method determines positives by 0.05 FDR adjusted P-value threshold. Each simulation is replicated 100 times with m =100 samples and p=1000 attributes with 100 functional (A) main effects (bias = 0.8) and (B) interaction network effects (sint=0.4) (Color version of this figure is available at Bioinformatics online.)
Fig. 3.
Fig. 3.
The effect of k on the performance of STIR to detect functional attributes with main effects (A) and interaction effects (B). Comparison of the performance (True Negative Rate, Precision, and Recall) of STIR-ReliefF for multiple values of nearest neighbors k (k=5,16,33,49, gray scale) and STIR-multiSURF (adaptive radius, mauve). All methods determine positives using a 0.05 FDR adjusted P-value threshold. Each simulation is replicated 100 times with m =100 samples and p=1000 attributes with 100 functional (Color version of this figure is available at Bioinformatics online.)
Fig. 4.
Fig. 4.
Major depressive disorder gene scatter plot of log10 adjusted significance for STIR-multiSURF and standard t-test for RNA-Seq differential expression. STIR-multiSURF finds 32 genes that are significant at the FDR-adjusted 0.05 level (above horizontal dashed line). Standard t-test finds eight genes that are significant at the FDR-adjusted 0.05 level (to right of vertical dashed line). STIR identifies all eight significant main effects from the t-test (gray) and additional candidate genes (mauve) that may involve interactions. Due to overlap of plot points, not all significant genes are labeled. See Supplementary Figure S2 for detailed labels (Color version of this figure is available at Bioinformatics online.)

Similar articles

See all similar articles

Cited by 2 articles

References

    1. Benjamini Y. et al. (2001) Controlling the false discovery rate in behavior genetics research. Behav. Brain Res., 125, 279–284. - PubMed
    1. Greene C.S. et al. (2009) Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min., 2, 5.. - PMC - PubMed
    1. Kira K., Rendell L.A. (1992). The feature selection problem: traditional methods and a new algorithm In: Proceedings Tenth National Conference on Artificial Intelligence, AAAI Press/The MIT Press, San Francisco, CA, pp. 129–134.
    1. Kononenko I. et al. (1997) Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell., 7, 39–55.
    1. Lareau C.A. et al. (2015) Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure. BioData Min., 8, 5.. - PMC - PubMed

Publication types

Feedback