Permutation importance: a corrected feature importance measure
- PMID: 20385727
- DOI: 10.1093/bioinformatics/btq134
Permutation importance: a corrected feature importance measure
Abstract
Motivation: In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred.
Results: In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models.
Availability: R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/ approximately altmann/download/PIMP.R CONTACT: altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Similar articles
-
Predictor correlation impacts machine learning algorithms: implications for genomic studies.Bioinformatics. 2009 Aug 1;25(15):1884-90. doi: 10.1093/bioinformatics/btp331. Epub 2009 May 21. Bioinformatics. 2009. PMID: 19460890
-
Prediction of protein-RNA binding sites by a random forest method with combined features.Bioinformatics. 2010 Jul 1;26(13):1616-22. doi: 10.1093/bioinformatics/btq253. Epub 2010 May 18. Bioinformatics. 2010. PMID: 20483814
-
Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838. Proteins. 2008. PMID: 18186470
-
Classification based upon gene expression data: bias and precision of error rates.Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28. Bioinformatics. 2007. PMID: 17392326 Review.
-
A review of feature selection techniques in bioinformatics.Bioinformatics. 2007 Oct 1;23(19):2507-17. doi: 10.1093/bioinformatics/btm344. Epub 2007 Aug 24. Bioinformatics. 2007. PMID: 17720704 Review.
Cited by
-
Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study.JMIR Form Res. 2022 Aug 30;6(8):e35563. doi: 10.2196/35563. JMIR Form Res. 2022. PMID: 36040781 Free PMC article.
-
Predictive Values of Preoperative Characteristics for 30-Day Mortality in Traumatic Hip Fracture Patients.J Pers Med. 2021 Apr 28;11(5):353. doi: 10.3390/jpm11050353. J Pers Med. 2021. PMID: 33924993 Free PMC article.
-
In silico identification of enhancers on the basis of a combination of transcription factor binding motif occurrences.Sci Rep. 2016 Sep 1;6:32476. doi: 10.1038/srep32476. Sci Rep. 2016. PMID: 27582178 Free PMC article.
-
The role of external factors on the reactivation of the heritage language of Turkish-German returnees.Front Psychol. 2023 Dec 1;14:1156779. doi: 10.3389/fpsyg.2023.1156779. eCollection 2023. Front Psychol. 2023. PMID: 38106400 Free PMC article.
-
Prediction of the number of asthma patients using environmental factors based on deep learning algorithms.Respir Res. 2023 Dec 1;24(1):302. doi: 10.1186/s12931-023-02616-x. Respir Res. 2023. PMID: 38041105 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous
