Selection of patient samples and genes for outcome prediction

Proc IEEE Comput Syst Bioinform Conf. 2004:382-92. doi: 10.1109/csb.2004.1332451.

Abstract

Gene expression profiles with clinical outcome data enable monitoring of disease progression and prediction of patient survival at the molecular level. We present a new computational method for outcome prediction. Our idea is to use an informative subset of original training samples. This subset consists of only short-term survivors who died within a short period and long-term survivors who were still alive after a long follow-up time. These extreme training samples yield a clear platform to identify genes whose expression is related to survival. To find relevant genes, we combine two feature selection methods -- entropy measure and Wilcoxon rank sum test -- so that a set of sharp discriminating features are identified. The selected training samples and genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a survival/relapse risk score for drawing Kaplan-Meier survival curves. We apply this method to two data sets: diffuse large-B-cell lymphoma (DLBCL) and primary lung adenocarcinoma. In both cases, patients in high and low risk groups stratified by our risk scores are clearly distinguishable. We also compare our risk scores to some clinical factors, such as International Prognostic Index score for DLBCL analysis and tumor stage information for lung adenocarcinoma. Our results indicate that gene expression profiles combined with carefully chosen learning algorithms can predict patient survival for certain diseases.

MeSH terms

  • Biomarkers, Tumor / analysis*
  • Diagnosis, Computer-Assisted / methods
  • Gene Expression Profiling / methods*
  • Humans
  • Neoplasm Proteins / analysis*
  • Neoplasms / diagnosis*
  • Neoplasms / metabolism*
  • Neoplasms / mortality
  • Outcome Assessment, Health Care / methods*
  • Prognosis
  • Reproducibility of Results
  • Risk Assessment / methods
  • Risk Factors
  • Sample Size
  • Sensitivity and Specificity
  • Survival Analysis*
  • Survival Rate
  • Survivors / statistics & numerical data

Substances

  • Biomarkers, Tumor
  • Neoplasm Proteins