A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions

Int J Appl Sci Technol. 2012 Aug;2(7):268.

Abstract

Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods-logistic regression, logic regression, classification tree, and random forests-to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.

Keywords: Area under the Curve; Classification tree; Cross-validation error; Logic regression; Logistic regression; Random Forests; SNP interactions.