The curse of dimensionality: Animal-related risk factors for pediatric diarrhea in western Kenya, and methods for dealing with a large number of predictors

PLoS One. 2019 Apr 26;14(4):e0215982. doi: 10.1371/journal.pone.0215982. eCollection 2019.


Background: Pediatric diarrhea, a leading cause of under-five mortality, is predominantly infectious in etiology. As many putative causal agents are zoonotic, animal exposure is a likely risk factor. To evaluate the effect of animal-related factors on moderate to severe childhood diarrhea in rural Kenya, where animal contact is common, Conan et al. studied 73 matched case-control pairs from 2009-2011, collecting rich exposure data on many dimensions of animal contact. We review the challenges associated with analyzing moderately-sized datasets with a large number of predictors and present two alternative methodological approaches.

Methodology/principal findings: We conducted a simulation study to demonstrate that forward stepwise selection results in overfit models when data are high-dimensional, and that p values reported directly from the data used to train these models are misleading. We described how automated methods of variable selection, attractive when the number of predictors is large, can result in overadjustment bias. We proposed an alternative a priori regression approach not subject to this bias. Applied to Conan et al.'s data, this approach found a non-significant but positive trend for household's sharing of water sources with livestock or poultry, child's presence for poultry slaughter, and child's habit of playing where poultry sleep or defecate. For many predictors evaluated few pairs were discordant, suggesting matching compromised the power of this analysis. Finally, we proposed latent variable modeling as a complimentary approach and performed Item Response Theory modeling on Conan et al.'s data, with animal contact as the latent trait. We found a moderate but non-significant effect (OR 1.21, 95% CI 0.78, 1.87, unit = 1 standard deviation).

Conclusions/significance: Automated methods of model selection are appropriate for prediction models when fit and evaluated on separate samples. However when the goal is inference, these methods can produce misleading results. Furthermore, case-control matching should be done with caution.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Animals
  • Case-Control Studies
  • Child
  • Computer Simulation
  • Confounding Factors, Epidemiologic
  • Diarrhea / epidemiology*
  • Humans
  • Kenya / epidemiology
  • Latent Class Analysis
  • Models, Biological
  • Regression Analysis
  • Risk Factors