Background: Rising healthcare costs and demand call for better identification of individuals at risk of high-cost healthcare use. Few prediction models use detailed survey data or address persistent high-cost use in the general population. This study aimed to develop and externally validate prediction models for all-cause single-year and persistent high-cost healthcare use, and to assess whether adding survey data to administrative registry data improved performance.
Methods: This was a prognostic study based on two population-based cohorts, the Trøndelag Health Study (HUNT4; model development) and the Tromsø Study (Tromsø7; external validation), linked to prospectively collected health registry data from primary and secondary care. Outcomes were (1) single-year high-cost use, defined as being in the top 25% of total healthcare costs in year one after survey completion, and (2) persistent high-cost use, defined as being in the top 25% in both years one and two. Predictors included self-reported sociodemographic and health-related variables and health registry data (prior-year costs and a morbidity index). Logistic regression models were developed for each outcome and internally validated via five-fold cross-validation. Model performance was assessed through discrimination and calibration. XGBoost models were trained and tested for benchmarking. External validation applied the developed models without refitting. We also developed and validated registry-only and survey-only models to compare performance against the full model.
Results: The development cohort included 42,049 individuals, and the external validation cohort included 20,942. In internal validation, the full logistic regression model achieved C-statistics of 0.79 (95% CI 0.78–0.79) for single-year high-cost use and 0.83 (95% CI 0.83–0.84) for persistent high-cost use. Corresponding C-statistics in external validation were 0.78 (95% CI 0.77–0.78) and 0.82 (95% CI 0.81–0.83). The models appeared well-calibrated on calibration plots. Full models showed significantly higher C-statistics than registry-only models (p < 0.001).
Conclusion: Prediction models for identifying all-cause single-year high-cost and persistent high-cost healthcare use in the general adult population were developed and validated, showing good discrimination and calibration. The models can inform targeted preventive strategies and population health management. Incorporating self-reported survey data improved predictive performance, supporting the use of combining data sources for risk stratification.
Supplementary Information: The online version contains supplementary material available at 10.1186/s12913-026-14295-7.
Keywords: Health expenditures; Healthcare utilisation; High-cost users; Prediction.