Effect of separate sampling on classification accuracy

Mohammad Shahrokh Esfahani; Edward R Dougherty

doi:10.1093/bioinformatics/btt662

Effect of separate sampling on classification accuracy

Bioinformatics. 2014 Jan 15;30(2):242-50. doi: 10.1093/bioinformatics/btt662. Epub 2013 Nov 20.

Authors

Mohammad Shahrokh Esfahani¹, Edward R Dougherty

Affiliation

¹ Department of Electrical and Computer Engineering and Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX 77843, USA.

PMID: 24257187
DOI: 10.1093/bioinformatics/btt662

Abstract

Motivation: Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random. In this 'separate sampling' scenario, the data cannot be used to estimate the class prior probabilities. Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples.

Results: We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules. We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio. From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data. We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier.

Availability: All the codes for synthetic data and real data examples are written in MATLAB. A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB. All the codes are available at: http://gsp.tamu.edu/Publications/supplementary/shahrokh13b.

MeSH terms

Algorithms*
Bayes Theorem
Breast Neoplasms / classification*
Breast Neoplasms / genetics
Child
Discriminant Analysis
Female
Gene Expression Profiling
Humans
Leukemia, Myeloid, Acute / classification*
Leukemia, Myeloid, Acute / genetics
Multiple Myeloma / classification*
Multiple Myeloma / genetics
Precursor Cell Lymphoblastic Leukemia-Lymphoma / classification*
Precursor Cell Lymphoblastic Leukemia-Lymphoma / genetics
Sample Size
Selection Bias*