Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Jul 10:10:213.
doi: 10.1186/1471-2105-10-213.

A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

Affiliations
Comparative Study

A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

Bjoern H Menze et al. BMC Bioinformatics. .

Abstract

Background: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.

Results: We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.

Conclusion: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but - on an optimal subset of features - the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Decision trees separating two classes: Classification problem with uncorrelated features (left), and a distorted version resulting from an additive noise process (right). The said process induces correlation by adding a random value to both features, thus mimicking the acquisition process of many absorption, reflectance or resonance spectra (see Methods section). Growing orthogonal decision trees on such a data set – shown on the right – results in deeply nested trees with complex decision boundaries. (Both trees not grown to full depth for visualization purposes).
Figure 2
Figure 2
Importance measures on NMR candida data. in the range from 0.35 to 4 ppm (indicated in the upper figure) for all 1500 spectral channels (indicated in the lower figure). Top: p-values of a t-test (black) and Wilcoxon test (gray). Below: Gini importance of a random forest with 3000 trees (gray) and 6000 trees (black). Compare t ranked measures in Fig. 3.
Figure 3
Figure 3
Comparison of the different feature selection measures applied to the NMR candida 2 data (3A). Multivariate feature importance measures can select variables that are discarded by univariate measures (3B). Fig. 3A, from top to bottom: Gini importance, absolute values; Gini importance, ranked values, p-values from t-test, ranked values. Fig. 3B: Feature importance scores below (black: Gini importance, gray: t-test). Perhaps surprisingly, regions with complete overlap of the marginal distributions (3B bottom, indicated by vertical lines), are assigned importance by the multivariate measure (3B top). This is indicative of higher-order interaction effects which can be exploited when used as a feature importance measure with a subsequent classifier.
Figure 4
Figure 4
Tukey mean-difference plot of univariate and multivariate feature importance (left) and correlation of the importance measures shown in Fig. 3A. Horizontal lines in the left Fig. indicate differences of more than two sigma, the vertical line in the right Fig. indicates a threshold on the univariate P-value of 0.05 (with relevant features being to the right of the vertical line). -. The importances assigned by univariate and multivariate measures are generally highly correlated; many of the features marked in red (corresponding to the spectral channels indicated in Fig. 3B), however, are flagged as uninformative by a univariate measure and as relevant by a multivariate measure.
Figure 5
Figure 5
Classification accuracy (left column) and standard error (right column) during the course of recursive feature elimination for PLS regression (black), PC regression (dark gray) and random forest (light gray), in combination with different feature selection criteria: univariate (dotted), PLS/PC regression (dashed) and Gini importance (solid).
Figure 6
Figure 6
Channel-wise variance of each feature (horizontal axis) and its correlation with the dependent variable (vertical axis). For the data sets of the left and the central column, a feature selection was not required for optimal performance, while the data sets shown in the right columns benefitted from a feature selection. Circle diameter indicates magnitude of the coefficient in the PLS regression. In the right column selected features are shown by red circles, while (the original values of) eliminated features are indicated by black dots. Relevant features show both a high variance and correlation with the class labels.
Figure 7
Figure 7
The effect of different noise processes on the performance of a random forest (green triangles) and a PLS classification (red circles). In the left column, feature vectors are augmented by a random variable, which is subsequently rescaled according to a factor S (horizontal axis), thus introducing non-discriminatory variance to the classification problem. In the right column, a random variable scaled by factor S is added as constant offset to the feature vectors, increasing the correlation between features (see text for details). Shown are results on the basis of the bivariate classification problem of Fig. 1 (top row), the NMR candida 2 data (middle), and the BSE binned data (below).
Figure 8
Figure 8
The effect of different noise processes on the performance of the feature selection methods in the synthetic bivariate classification problem illustrated in Fig. 1. In the left column feature vectors are extended by a random variable scaled by S, in the right column a random offset of size S is added to the feature vectors. Top row: classification accuracy of the synthetic two-class problem (as in Fig. 7, for comparison); second row: multivariate Gini importance, bottom row: p-values of univariate t-test. The black lines correspond to the values of the two features spanning the bivariate classification task (Fig. 1), the blue dotted line corresponds to the third feature in the synthetic data set, the random variable. The performance of the random forest remains nearly unchanged even under the presence of a strong source of "local" noise for high values of S.

Similar articles

Cited by

References

    1. Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82. doi: 10.1162/153244303322753616. - DOI
    1. Stone M, J R, Brooks Continuum regression Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J Roy Stat Soc B (Meth) 1990;52:237–269.
    1. Frank IE, Friedman JH. A statistical view of some Chemometrics regression tools. Technometrics. 1993;35:109–135. doi: 10.2307/1269656. - DOI
    1. Bylesjö M, Rantalainen M, Nicholson JK, Holmes E, Trygg J. K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space. BMC Bioinformatics. 2008;9:106. doi: 10.1186/1471-2105-9-106. - DOI - PMC - PubMed
    1. Westad F, Martens H. Variable selection in near infrared spectroscopy based on significance testing in partial least squares regression. J Near Infrared Spectrosc. 2000;117:117–124. doi: 10.1255/jnirs.271. - DOI

Publication types

MeSH terms

LinkOut - more resources