Background: Atrial fibrillation (AF) is a major risk factor for atherothrombotic complications but is often asymptomatic and undiagnosed. This study aimed to develop a machine learning model to distinguish between individuals with low and high risk of AF, using routinely collected diagnostic data from Swedish primary health care.
Methods: Cases (n = 42,607, aged ≥ 45 years) with diagnosed new onset AF and controls (n = 427,169) matched by age and sex. Machine learning models stratified for age (45–69 and ≥ 70 years) and sex were developed using stochastic gradient boosting, based on number of primary health care visits during the year before the index AF diagnosis, age, and ICD-10 codes from electronic medical records 2014–2019. Performance was evaluated by AUC, sensitivity and specificity, and key predictors ranked by normalized relative influence (NRI) and odds ratios for marginal effects.
Results: The most influential predictors were the number of visits (NRI: 29.9–46.3%) and age (NRI: 6.2–15.9%), followed by risk factors for AF such as heart failure, hypertension, and cardiac arrhythmias. Model AUC ranged from 0.77 to 0.79 across subgroups. Sensitivity was 0.76–0.80, and specificity 0.58–0.66, with higher sensitivity in older groups and higher specificity in younger ones. The models correctly identified 95–98% of individuals without known AF.
Conclusions: The models show good predictive ability, effectively ruling out low-risk patients while identifying known risk factors. With AUC values comparable to more complex models, our approach using only visit frequency, age, and diagnoses may support initial risk assessment in primary health care for identifying individuals at risk of AF.
Supplementary Information: The online version contains supplementary material available at 10.1186/s12911-026-03491-4.
Keywords: Artificial intelligence; Atrial fibrillation; Gradient boosting; Normalized relative influence; Prediction.