Toward a measure of classification complexity in gene expression signatures

Vidya Kamath; Timothy J Yeatman; Steven A Eschrich

doi:10.1109/IEMBS.2008.4650509

Toward a measure of classification complexity in gene expression signatures

Annu Int Conf IEEE Eng Med Biol Soc. 2008:2008:5704-7. doi: 10.1109/IEMBS.2008.4650509.

Authors

Vidya Kamath¹, Timothy J Yeatman, Steven A Eschrich

Affiliation

¹ Biomedical Engineering program at the University of South Florida, Tampa, Florida, USA. Vidya.Kamath@moffitt.org

PMID: 19164012
DOI: 10.1109/IEMBS.2008.4650509

Abstract

Gene expression signatures identify important genes that predict a specified outcome. In several notable diseases such as leukemia and breast cancer, the results have been encouraging. In these datasets, many techniques work well when discriminating particular outcomes. However, these same methods, applied to other datasets, are unable to achieve similar levels of success. Given the small sample sizes common to these studies and the large dimensionality of the data, several key issues exist when attempting to construct reliable, reproducible gene signatures. The classifiers may not be sufficient to discriminate classes, or the data itself may not be sufficient to produce effective separation. In this paper, three simple measures of classification complexity are considered to explore a limit to the predictive accuracy that can be achieved in a dataset. Two independent gene expression datasets (lung and colorectal cancer) are considered, using three different outcomes on each dataset. Four different classifiers, using the t-test for feature selection, were tested on these datasets as a representative panel of classifiers. Our results indicate that Fisher's discriminant ratio provides a good measure of the complexity of the classification problem, with a high correlation between complexity and best classification accuracy across all problems (R(2)=0.78). Specifically, predicting gender is a low complexity problem as indicated both by the complexity measure and the classification results. More clinically-oriented endpoints are more complex and have lower classification accuracies. Therefore, classification complexity can be used to estimate maximum attainable accuracy for a problem reducing the need to evaluate many different classifiers.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Artificial Intelligence
Biomarkers, Tumor / analysis*
Diagnosis, Computer-Assisted / methods*
Gene Expression Profiling / methods*
Humans
Neoplasm Proteins / analysis
Neoplasms / diagnosis*
Neoplasms / metabolism*
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated / methods*
Reproducibility of Results
Sample Size
Sensitivity and Specificity
Signal Processing, Computer-Assisted

Substances

Biomarkers, Tumor
Neoplasm Proteins