Multiple-testing strategy for analyzing cDNA array data on gene expression

Biometrics. 2004 Sep;60(3):774-82. doi: 10.1111/j.0006-341X.2004.00228.x.

Abstract

An objective of many functional genomics studies is to estimate treatment-induced changes in gene expression. cDNA arrays interrogate each tissue sample for the levels of mRNA for hundreds to tens of thousands of genes, and the use of this technology leads to a multitude of treatment contrasts. By-gene hypotheses tests evaluate the evidence supporting no effect, but selecting a significance level requires dealing with the multitude of comparisons. The p-values from these tests order the genes such that a p-value cutoff divides the genes into two sets. Ideally one set would contain the affected genes and the other would contain the unaffected genes. However, the set of genes selected as affected will have false positives, i.e., genes that are not affected by treatment. Likewise, the other set of genes, selected as unaffected, will contain false negatives, i.e., genes that are affected. A plot of the observed p-values (1 - p) versus their expectation under a uniform [0, 1] distribution allows one to estimate the number of true null hypotheses. With this estimate, the false positive rates and false negative rates associated with any p-value cutoff can be estimated. When computed for a range of cutoffs, these rates summarize the ability of the study to resolve effects. In our work, we are more interested in selecting most of the affected genes rather than protecting against a few false positives. An optimum cutoff, i.e., the best set given the data, depends upon the relative cost of falsely classifying a gene as affected versus the cost of falsely classifying a gene as unaffected. We select the cutoff by a decision-theoretic method analogous to methods developed for receiver operating characteristic curves. In addition, we estimate the false discovery rate and the false nondiscovery rate associated with any cutoff value. Two functional genomics studies that were designed to assess a treatment effect are used to illustrate how the methods allowed the investigators to determine a cutoff to suit their research goals.

MeSH terms

  • Amphetamine / pharmacology
  • Animals
  • Biometry
  • Brain / drug effects
  • Brain / metabolism
  • Data Interpretation, Statistical
  • Gene Expression / drug effects
  • Gene Expression Profiling / statistics & numerical data*
  • Genomics / statistics & numerical data
  • Models, Statistical
  • Oligonucleotide Array Sequence Analysis / statistics & numerical data*
  • ROC Curve
  • Rats

Substances

  • Amphetamine