Effect of separate sampling on classification accuracy

Bioinformatics. 2014 Jan 15;30(2):242-50. doi: 10.1093/bioinformatics/btt662. Epub 2013 Nov 20.

Abstract

Motivation: Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random. In this 'separate sampling' scenario, the data cannot be used to estimate the class prior probabilities. Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples.

Results: We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules. We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio. From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data. We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier.

Availability: All the codes for synthetic data and real data examples are written in MATLAB. A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB. All the codes are available at: http://gsp.tamu.edu/Publications/supplementary/shahrokh13b.

MeSH terms

  • Algorithms*
  • Bayes Theorem
  • Breast Neoplasms / classification*
  • Breast Neoplasms / genetics
  • Child
  • Discriminant Analysis
  • Female
  • Gene Expression Profiling
  • Humans
  • Leukemia, Myeloid, Acute / classification*
  • Leukemia, Myeloid, Acute / genetics
  • Multiple Myeloma / classification*
  • Multiple Myeloma / genetics
  • Precursor Cell Lymphoblastic Leukemia-Lymphoma / classification*
  • Precursor Cell Lymphoblastic Leukemia-Lymphoma / genetics
  • Sample Size
  • Selection Bias*