Molecular diagnosis. Classification, model selection and performance evaluation

Methods Inf Med. 2005;44(3):438-43.

Abstract

Objectives: We discuss supervised classification techniques applied to medical diagnosis based on gene expression profiles. Our focus lies on strategies of adaptive model selection to avoid overfitting in high-dimensional spaces.

Methods: We introduce likelihood-based methods, classification trees, support vector machines and regularized binary regression. For regularization by dimension reduction, we describe feature selection methods: feature filtering, feature shrinkage and wrapper approaches. In small sample-size situations efficient methods of data re-use are needed to assess the predictive power of a model. We discuss two issues in using cross-validation: the difference between in-loop and out-of-loop feature selection, and estimating model parameters in nested-loop cross-validation.

Results: Gene selection does not reduce the dimensionality of the model. Tuning parameters enable adaptive model selection. The feature selection bias is a common pitfall in performance evaluation. Model selection and performance evaluation can be combined by nested-loop cross-validation.

Conclusions: Classification of microarrays is prone to overfitting. A rigorous and unbiased assessment of the predictive power of the model is a must.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Gene Expression Profiling / classification
  • Gene Expression Profiling / methods*
  • Genetic Research
  • Mathematical Computing*
  • Models, Statistical
  • Molecular Diagnostic Techniques / methods*
  • Oligonucleotide Array Sequence Analysis / classification
  • Oligonucleotide Array Sequence Analysis / methods*
  • Probability
  • Reproducibility of Results
  • Risk
  • Selection Bias