Optimized ranking and selection methods for feature selection with application in microarray experiments

J Biopharm Stat. 2010 Mar;20(2):223-39. doi: 10.1080/10543400903572720.

Abstract

In microarray experiments, the goal is often to examine many genes, and select some of them for additional investigation. Traditionally, such a selection problem has been formulated as a multiple testing problem. When the genes of interest are genes with unequal distribution of gene expression under different conditions, multiple testing methods provide an appropriate framework for addressing the selection problems. However, when the genes of interest are a set of genes with the largest difference in gene expression under different conditions, multiple testing methods do not directly address the selection goal and sometimes lead to biased conclusions. For such cases, we propose two methods based on the statistical ranking and selection framework to directly address the selection goal. The proposed methods have an inherent optimization nature in that the selection is optimized according to either a prespecified minimum correct selection ratio (r* selection) or probability of making a correct selection (P* selection). These methods are compared with the multiple testing method that controls the tail probability of the proportion of false positives. Both simulation studies and real data applications provide insight into the fundamental difference between the multiple testing methods and the proposed methods in the way of addressing different selection goals. It has been shown that the proposed methods provide a clear advantage over the multiple testing methods when the goal is to select the most significant genes (not all the significant genes). When the goal is to select all the significant genes, the proposed methods perform equally well as the current multiple testing methods. Another advantage provided by the proposed methods is their ability to detect noisy data and therefore suggest no sensible selection can be made.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Apolipoprotein A-I / deficiency
  • Apolipoprotein A-I / genetics
  • Computer Simulation
  • Data Interpretation, Statistical
  • Gene Expression Profiling / statistics & numerical data*
  • Gene Expression Regulation
  • Gene Expression Regulation, Leukemic
  • Humans
  • Leukemia / genetics*
  • Mice
  • Mice, Knockout
  • Models, Statistical*
  • Oligonucleotide Array Sequence Analysis / statistics & numerical data*
  • Probability
  • Reproducibility of Results

Substances

  • Apolipoprotein A-I