Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users

BMC Med Res Methodol. 2025 May 15;25(1):134. doi: 10.1186/s12874-025-02561-x.

Abstract

Background: Supervised machine learning is increasingly being used to estimate clinical predictive models. Several supervised machine learning models involve hyper-parameters, whose values must be judiciously specified to ensure adequate predictive performance.

Objective: To compare several (nine) hyper-parameter optimization (HPO) methods, for tuning the hyper-parameters of an extreme gradient boosting model, with application to predicting high-need high-cost health care users.

Methods: Extreme gradient boosting models were estimated using a randomly sampled training dataset. Models were separately trained using nine different HPO methods: 1) random sampling, 2) simulated annealing, 3) quasi-Monte Carlo sampling, 4-5) two variations of Bayesian hyper-parameter optimization via tree-Parzen estimation, 6-7) two implementations of Bayesian hyper-parameter optimization via Gaussian processes, 8) Bayesian hyper-parameter optimization via random forests, and 9) the covariance matrix adaptation evolutionary strategy. For each HPO method, we estimated 100 extreme gradient boosting models at different hyper-parameter configurations; and evaluated model performance using an AUC metric on a randomly sampled validation dataset. Using the best model identified by each HPO method, we evaluated generalization performance in terms of discrimination and calibration metrics on a randomly sampled held-out test dataset (internal validation) and a temporally independent dataset (external validation).

Results: The extreme gradient boosting model estimated using default hyper-parameter settings had reasonable discrimination (AUC=0.82) but was not well calibrated. Hyper-parameter tuning using any HPO algorithm/sampler improved model discrimination (AUC=0.84), resulted in models with near perfect calibration, and consistently identified features predictive of high-need high-cost health care users.

Conclusions: In our study, all HPO algorithms resulted in similar gains in model performance relative to baseline models. This finding likely relates to our study dataset having a large sample size, a relatively small number of features, and a strong signal to noise ratio; and would likely apply to other datasets with similar characteristics.

Keywords: Clinical predictive modelling; Extreme gradient boosting classifier; Hyper-parameter optimization (HPO); Hyper-parameter tuning (HPT); Prediction model; Supervised machine learning.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Health Care Costs* / statistics & numerical data
  • Health Services Needs and Demand* / economics
  • Health Services Needs and Demand* / statistics & numerical data
  • Humans
  • Machine Learning*
  • Monte Carlo Method