Partial least squares and logistic regression random-effects estimates for gene selection in supervised classification of gene expression data

J Biomed Inform. 2013 Aug;46(4):697-709. doi: 10.1016/j.jbi.2013.05.008. Epub 2013 Jun 7.

Abstract

Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.

Keywords: Filtering; Gene selection; Logistic regression; Partial least squares; Random effects; Supervised classification.

MeSH terms

  • Gene Expression*
  • Humans
  • Least-Squares Analysis*
  • Likelihood Functions
  • Logistic Models*
  • Lymphoma / genetics
  • Selection, Genetic*