Overlapping probabilities of top ranking gene lists, hypergeometric distribution, and stringency of gene selection criterion

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:5531-4. doi: 10.1109/IEMBS.2006.260828.


When the same set of genes appear in two top ranking gene lists in two different studies, it is often of interest to estimate the probability for this being a chance event. This overlapping probability is well known to follow the hypergeometric distribution. Usually, the lengths of top-ranking gene lists are assumed to be fixed, by using a pre-set criterion on, e.g., p-value for the t-test. We investigate how overlapping probability changes with the gene selection criterion, or simply, with the length of the top-ranking gene lists. It is concluded that overlapping probability is indeed a function of the gene list length, and its statistical significance should be quoted in the context of gene selection criterion.

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Data Interpretation, Statistical
  • Databases, Protein
  • Gene Expression Profiling*
  • Gene Expression Regulation*
  • Humans
  • Models, Genetic
  • Models, Statistical
  • Models, Theoretical
  • Oligonucleotide Array Sequence Analysis / instrumentation
  • Oligonucleotide Array Sequence Analysis / methods*
  • Pattern Recognition, Automated
  • Probability
  • Software