A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
- PMID: 19591666
- PMCID: PMC2724423
- DOI: 10.1186/1471-2105-10-213
A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
Abstract
Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.
Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.
Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.
Figures
Similar articles
-
Multivariate feature selection and hierarchical classification for infrared spectroscopy: serum-based detection of bovine spongiform encephalopathy.Anal Bioanal Chem. 2007 Mar;387(5):1801-7. doi: 10.1007/s00216-006-1070-5. Epub 2007 Jan 20. Anal Bioanal Chem. 2007. PMID: 17237926
-
Comparison of feature selection and classification for MALDI-MS data.BMC Genomics. 2009 Jul 7;10 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2164-10-S1-S3. BMC Genomics. 2009. PMID: 19594880 Free PMC article.
-
Merits of random forests emerge in evaluation of chemometric classifiers by external validation.Anal Chim Acta. 2013 Nov 1;801:22-33. doi: 10.1016/j.aca.2013.09.027. Epub 2013 Sep 23. Anal Chim Acta. 2013. PMID: 24139571
-
Computational advances of tumor marker selection and sample classification in cancer proteomics.Comput Struct Biotechnol J. 2020 Jul 17;18:2012-2025. doi: 10.1016/j.csbj.2020.07.009. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32802273 Free PMC article. Review.
-
Selection of discriminant mid-infrared wavenumbers by combining a naïve Bayesian classifier and a genetic algorithm: Application to the evaluation of lignocellulosic biomass biodegradation.Math Biosci. 2017 Jul;289:153-161. doi: 10.1016/j.mbs.2017.05.002. Epub 2017 May 13. Math Biosci. 2017. PMID: 28511958 Review.
Cited by
-
Application of machine learning algorithms to predict 30-day hospital readmission following cement augmentation for osteoporotic vertebral compression fractures.World Neurosurg X. 2024 Mar 2;23:100338. doi: 10.1016/j.wnsx.2024.100338. eCollection 2024 Jul. World Neurosurg X. 2024. PMID: 38497061 Free PMC article.
-
Towards identification of postharvest fruit quality transcriptomic markers in Malus domestica.PLoS One. 2024 Mar 6;19(3):e0297015. doi: 10.1371/journal.pone.0297015. eCollection 2024. PLoS One. 2024. PMID: 38446822 Free PMC article.
-
Probing delivery of a lipid nanoparticle encapsulated self-amplifying mRNA vaccine using coherent Raman microscopy and multiphoton imaging.Sci Rep. 2024 Feb 22;14(1):4348. doi: 10.1038/s41598-024-54697-3. Sci Rep. 2024. PMID: 38388635 Free PMC article.
-
Can adverse childhood experiences predict chronic health conditions? Development of trauma-informed, explainable machine learning models.Front Public Health. 2024 Jan 15;11:1309490. doi: 10.3389/fpubh.2023.1309490. eCollection 2023. Front Public Health. 2024. PMID: 38332940 Free PMC article.
-
Uplift modeling to identify patients who require extensive catheter ablation procedures among patients with persistent atrial fibrillation.Sci Rep. 2024 Feb 1;14(1):2634. doi: 10.1038/s41598-024-52976-7. Sci Rep. 2024. PMID: 38302547 Free PMC article.
References
-
- Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82. doi: 10.1162/153244303322753616. - DOI
-
- Stone M, J R, Brooks Continuum regression Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J Roy Stat Soc B (Meth) 1990;52:237–269.
-
- Frank IE, Friedman JH. A statistical view of some Chemometrics regression tools. Technometrics. 1993;35:109–135. doi: 10.2307/1269656. - DOI
-
- Westad F, Martens H. Variable selection in near infrared spectroscopy based on significance testing in partial least squares regression. J Near Infrared Spectrosc. 2000;117:117–124. doi: 10.1255/jnirs.271. - DOI
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
