Predicting HCV Incidence in Latinos with High-Risk Substance Use: A Data Science Approach

Soc Work Public Health. 2019;34(7):606-615. doi: 10.1080/19371918.2019.1635948. Epub 2019 Aug 2.


Hepatitis C virus (HCV) in the U.S. has tripled in the prior five years, and injecting drug use is the primary risk for HCV, with up to 90% of older and former people who inject drugs (PWIDs) testing positive. Laboratory testing of HCV for any PWIDs is the gold standard, however many PWIDs lack access to health treatment or services. Identifying risks of HCV via a data science approach would aid community health workers (CHW) to rapidly link those most at risk of infection with treatment. This study employed a data-science approach to determine the strongest risk factors of HCV in a sample of Mexican-Americans WIDs n = 221 (96 negative/125 positive). Data included 238 demographic and psychosocial predictors. A Random Forest machine learning algorithm demonstrated significant prediction improvement over baseline no information rate comparison. Strongest risks for positive HCV included sharing drug-use equipment and younger age at first heroin use; receiving drug-education during incarceration was protective. A ROC curve fit to the prediction yielded an area under the curve of 0.77. Predictive variables of HCV in the present analysis can be obtained via screening by CHW. Identification of patients most at risk of HCV within community settings can maximize treatment utilization.

Keywords: Hepatitis C; Latino; data science; injecting drug use; machine learning; random forest.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Female
  • Hepacivirus*
  • Hepatitis C / epidemiology*
  • Hepatitis C / etiology
  • Hispanic or Latino
  • Humans
  • Incidence
  • Male
  • Middle Aged
  • Risk-Taking*
  • Substance Abuse, Intravenous*
  • Surveys and Questionnaires
  • United States / epidemiology