Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 1;26(3):440-3.
doi: 10.1093/bioinformatics/btp621. Epub 2009 Oct 29.

Pitfalls of supervised feature selection

Pitfalls of supervised feature selection

Pawel Smialowski et al. Bioinformatics. .
No abstract available

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The correct [(A) followed (C)] and the incorrect [(B) followed by (C)] procedure for combining supervised feature selection and learning a classifier. In the figure, processes and products are depicted by ellipses and rectangles, respectively. Training and test sets consist of features X and a target attribute Y (to be predicted). X is a subset of features reduced by supervised feature selection, f() is a classifier and Ŷ contains the prediction of Y values by this function. (A and B) show the workflows for the correct and incorrect application of supervised feature selection, and (C) holds the evaluation workflow (more description in the text).
Fig. 2.
Fig. 2.
Relation between the number of instances and the extent of overfitting caused by feature selection as measured by AUROC growth. (A) Randomly generated attribute values, (B) randomly tagged real data. Three different feature selection algorithms were used: Wrapper (hashed bars), Relief Attribute Evaluation (white bars), PCA (black bars). Whiskers mark 95% confidence intervals.
Fig. 3.
Fig. 3.
Relation between information loss and overfitting measured by ΔAUROC growth. (A) Randomly generated attribute values, (B) randomly tagged real data. Three feature selection methods were examined: Wrapper (black circles), Relief Attribute Evaluation (open squares) and PCA (gray triangles). Lines were fitted by linear regression: solid lines to Wrapper and dashed to Relief Attribute Evaluation data points. ID ratio is the information density ratio.

Similar articles

Cited by

References

    1. Aha D, Kibler D. Instance-based learning algorithms. Mach. Learn. 1991;6:37–66.
    1. Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl Acad. Sci. USA. 2002;99:6562–6566. - PMC - PubMed
    1. Barla A, et al. Machine learning methods for predictive proteomics. Brief Bioinform. 2008;9:119–128. - PubMed
    1. Efron B. 22nd International Conference on Machine learning. Bonn: ACM; 2005. Reducing overfitting in process model induction; pp. 81–88.
    1. Frank E, et al. Data mining in bioinformatics using Weka. Bioinformatics. 2004;20:2479–2481. - PubMed

MeSH terms