Are the Relevant Risk Factors Being Adequately Captured in Empirical Studies of Smoking Initiation? A Machine Learning Analysis Based on the Population Assessment of Tobacco and Health Study

Nicotine Tob Res. 2023 Jul 14;25(8):1481-1488. doi: 10.1093/ntr/ntad066.


Introduction: Cigarette smoking continues to pose a threat to public health. Identifying individual risk factors for smoking initiation is essential to further mitigate this epidemic. To the best of our knowledge, no study today has used machine learning (ML) techniques to automatically uncover informative predictors of smoking onset among adults using the Population Assessment of Tobacco and Health (PATH) study.

Aims and methods: In this work, we employed random forest paired with Recursive Feature Elimination to identify relevant PATH variables that predict smoking initiation among adults who have never smoked at baseline between two consecutive PATH waves. We included all potentially informative baseline variables in wave 1 (wave 4) to predict past 30-day smoking status in wave 2 (wave 5). Using the first and most recent pairs of PATH waves was found sufficient to identify the key risk factors of smoking initiation and test their robustness over time. The eXtreme Gradient Boosting method was employed to test the quality of these selected variables.

Results: As a result, classification models suggested about 60 informative PATH variables among many candidate variables in each baseline wave. With these selected predictors, the resulting models have a high discriminatory power with the area under the specificity-sensitivity curves of around 80%. We examined the chosen variables and discovered important features. Across the considered waves, two factors, (1) BMI, and (2) dental and oral health status, robustly appeared as important predictors of smoking initiation, besides other well-established predictors.

Conclusions: Our work demonstrates that ML methods are useful to predict smoking initiation with high accuracy, identifying novel smoking initiation predictors, and to enhance our understanding of tobacco use behaviors.

Implications: Understanding individual risk factors for smoking initiation is essential to prevent smoking initiation. With this methodology, a set of the most informative predictors of smoking onset in the PATH data were identified. Besides reconfirming well-known risk factors, the findings suggested additional predictors of smoking initiation that have been overlooked in previous work. More studies that focus on the newly discovered factors (BMI and dental and oral health status,) are needed to confirm their predictive power against the onset of smoking as well as determine the underlying mechanisms.

MeSH terms

  • Adult
  • Cigarette Smoking* / epidemiology
  • Electronic Nicotine Delivery Systems*
  • Humans
  • Longitudinal Studies
  • Risk Factors
  • Tobacco Products*
  • Tobacco Use / epidemiology