Prediction of cancer outcome with microarrays: a multiple random validation strategy

Lancet. 2005 Feb 5-11;365(9458):488-92. doi: 10.1016/S0140-6736(05)17866-0.


Background: General studies of microarray gene-expression profiling have been undertaken to predict cancer outcome. Knowledge of this gene-expression profile or molecular signature should improve treatment of patients by allowing treatment to be tailored to the severity of the disease. We reanalysed data from the seven largest published studies that have attempted to predict prognosis of cancer patients on the basis of DNA microarray analysis.

Methods: The standard strategy is to identify a molecular signature (ie, the subset of genes most differentially expressed in patients with different outcomes) in a training set of patients and to estimate the proportion of misclassifications with this signature on an independent validation set of patients. We expanded this strategy (based on unique training and validation sets) by using multiple random sets, to study the stability of the molecular signature and the proportion of misclassifications.

Findings: The list of genes identified as predictors of prognosis was highly unstable; molecular signatures strongly depended on the selection of patients in the training sets. For all but one study, the proportion misclassified decreased as the number of patients in the training set increased. Because of inadequate validation, our chosen studies published overoptimistic results compared with those from our own analyses. Five of the seven studies did not classify patients better than chance.

Interpretation: The prognostic value of published microarray results in cancer studies should be considered with caution. We advocate the use of validation by repeated random sampling.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Gene Expression Profiling*
  • Humans
  • Neoplasms / genetics*
  • Neoplasms / therapy
  • Oligonucleotide Array Sequence Analysis*
  • Prognosis
  • Sample Size