Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias

Genet Epidemiol. 2016 Feb;40(2):123-32. doi: 10.1002/gepi.21946. Epub 2015 Dec 7.

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

Keywords: Random Forest; X chromosome; bias; sex differences; variable importance.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Alcoholism / genetics*
  • Algorithms
  • Bias*
  • Case-Control Studies
  • Chromosomes, Human, X / genetics*
  • Computer Simulation
  • Data Interpretation, Statistical
  • Decision Trees*
  • Genetic Markers / genetics
  • Genetic Predisposition to Disease*
  • Humans
  • Models, Genetic
  • Phenotype
  • Polymorphism, Single Nucleotide / genetics*
  • Sex Factors

Substances

  • Genetic Markers