Exploring correlations in gene expression microarray data for maximum predictive-minimum redundancy biomarker selection and classification

Jorge M Arevalillo; Hilario Navarro

doi:10.1016/j.compbiomed.2013.07.005

Exploring correlations in gene expression microarray data for maximum predictive-minimum redundancy biomarker selection and classification

Comput Biol Med. 2013 Oct;43(10):1437-43. doi: 10.1016/j.compbiomed.2013.07.005. Epub 2013 Jul 13.

Authors

Jorge M Arevalillo¹, Hilario Navarro

Affiliation

¹ Department of Statistics, Operational Research and Numerical Analysis, University Nacional Educación a Distancia (UNED), Paseo Senda del Rey 9, 28040 Madrid, Spain. Electronic address: jmartin@ccia.uned.es.

PMID: 24034735
DOI: 10.1016/j.compbiomed.2013.07.005

Abstract

An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a small sample of assays. Past and ongoing research efforts have been focused on biomarker selection for phenotype classification. Usually, many genes convey useless information for classifying the outcome and should be removed from the analysis; on the other hand, some of them may be highly correlated, which reveals the presence of redundant expressed information. In this paper we propose a method for the selection of highly predictive genes having a low redundancy in their expression levels. The predictive accuracy of the selection is assessed by means of Classification and Regression Trees (CART) models which enable assessment of the performance of the selected genes for classifying the outcome variable and will also uncover complex genetic interactions. The method is illustrated throughout the paper using a public domain colon cancer gene expression data set.

Keywords: Biomarker selection; Classification and prediction; Classification and regression tree; Gene expression; Microarray data; Redundancy.

MeSH terms

Algorithms
Biomarkers / analysis*
Computational Biology / methods*
Data Mining
Gene Expression Profiling / methods*
Humans
Models, Genetic*
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated
Proteins / chemistry
Proteins / genetics

Substances

Biomarkers
Proteins