Statistical interpretation of machine learning-based feature importance scores for biomarker discovery

Vân Anh Huynh-Thu; Yvan Saeys; Louis Wehenkel; Pierre Geurts

doi:10.1093/bioinformatics/bts238

Statistical interpretation of machine learning-based feature importance scores for biomarker discovery

Bioinformatics. 2012 Jul 1;28(13):1766-74. doi: 10.1093/bioinformatics/bts238. Epub 2012 Apr 25.

Authors

Vân Anh Huynh-Thu¹, Yvan Saeys, Louis Wehenkel, Pierre Geurts

Affiliation

¹ Department of Electrical Engineering and Computer Science, University of Liège, 4000 Liège, Belgium. vahuynh@ulg.ac.be

PMID: 22539669
DOI: 10.1093/bioinformatics/bts238

Abstract

Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians.

Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive.

Availability and implementation: Python source codes of all tested methods, as well as the MATLAB scripts used for data simulation, can be found in the Supplementary Material.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Biomarkers / analysis*
Computational Biology / methods
Data Interpretation, Statistical
Transcriptome

Substances

Biomarkers