Using Machine Learning to Uncover Hidden Heterogeneities in Survey Data

Christina M Ramirez; Marisa A Abrajano; R Michael Alvarez

doi:10.1038/s41598-019-51862-x

Using Machine Learning to Uncover Hidden Heterogeneities in Survey Data

Sci Rep. 2019 Nov 5;9(1):16061. doi: 10.1038/s41598-019-51862-x.

Authors

Christina M Ramirez¹, Marisa A Abrajano², R Michael Alvarez³

Affiliations

¹ Department of Biostatistics, UCLA Fielding School of Public Health, UCLA, Los Angeles, CA, 90095-1772, USA. cr@g.ucla.edu.
² Department of Political Science, University of California, San Diego, La Jolla, CA, 92093-0521, USA.
³ Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, 91125, USA.

Abstract

Survey responses in public health surveys are heterogeneous. The quality of a respondent's answers depends on many factors, including cognitive abilities, interview context, and whether the interview is in person or self-administered. A largely unexplored issue is how the language used for public health survey interviews is associated with the survey response. We introduce a machine learning approach, Fuzzy Forests, which we use for model selection. We use the 2013 California Health Interview Survey (CHIS) as our training sample and the 2014 CHIS as the test sample. We found that non-English language survey responses differ substantially from English responses in reported health outcomes. We also found heterogeneity among the Asian languages suggesting that caution should be used when interpreting results that compare across these languages. The 2013 Fuzzy Forests model also correctly predicted 86% of good health outcomes using 2014 data as the test set. We show that the Fuzzy Forests methodology is potentially useful for screening for and understanding other types of survey response heterogeneity. This is especially true in high-dimensional and complex surveys.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.