Background: Hypereosinophilic syndrome (HES) is challenging to diagnose and identify in real-world data.
Objective: We sought to develop a machine learning model to predict HES diagnosis in secondary data and estimate HES prevalence among individuals with elevated blood eosinophil count (BEC).
Methods: Open medical/pharmacy claims were used to develop a prediction model, including patients with ≥1 HES diagnosis code and ≥1 elevated BEC (>1,000 cells/μL), and non-HES controls with elevated BEC. Candidate predictors for HES diagnosis included demographics, disease manifestations, treatments, health care utilization, and procedures. A generalized linear mixed model with a binomial distribution and logit link function was used to construct the model. Model performance was evaluated using 5-fold cross-validation.
Results: A total of 260 patients with a HES diagnosis and 157,718 non-HES controls with elevated BEC were included. Predictors with the largest coefficients included bone marrow biopsy, eosinophilic gastrointestinal disorders, eosinophilic granulomatosis with polyangiitis, and autoimmune disease of the digestive system. The final model achieved an area under the receiver operating characteristic curve of 0.82 and area under the precision recall curve of 0.83. After applying the model (0.7 predicted probability threshold), 6,233 patients with predicted HES were identified. These patients exhibited similar characteristics to patients with a HES diagnosis code. With this threshold, the prevalence of predicted HES and total HES (diagnosed plus predicted) was 5.30% and 5.65%, respectively, among those with an elevated BEC.
Conclusion: A machine learning prediction model identified a substantial number of patients with predicted HES, suggesting that the actual prevalence of HES may be significantly underestimated.
Keywords: Hypereosinophilic syndrome; machine learning; prediction model; prevalence.
© 2025 The Authors.