Predicting the future risk of lung cancer: development, and internal and external validation of the CanPredict (lung) model in 19·67 million people and evaluation of model performance against seven other risk prediction models

Lancet Respir Med. 2023 Aug;11(8):685-697. doi: 10.1016/S2213-2600(23)00050-4. Epub 2023 Apr 5.


Background: Lung cancer is the second most common cancer in incidence and the leading cause of cancer deaths worldwide. Meanwhile, lung cancer screening with low-dose CT can reduce mortality. The UK National Screening Committee recommended targeted lung cancer screening on Sept 29, 2022, and asked for more modelling work to be done to help refine the recommendation. This study aims to develop and validate a risk prediction model-the CanPredict (lung) model-for lung cancer screening in the UK and compare the model performance against seven other risk prediction models.

Methods: For this retrospective, population-based, cohort study, we used linked electronic health records from two English primary care databases: QResearch (Jan 1, 2005-March 31, 2020) and Clinical Practice Research Datalink (CPRD) Gold (Jan 1, 2004-Jan 1, 2015). The primary study outcome was an incident diagnosis of lung cancer. We used a Cox proportional-hazards model in the derivation cohort (12·99 million individuals aged 25-84 years from the QResearch database) to develop the CanPredict (lung) model in men and women. We used discrimination measures (Harrell's C statistic, D statistic, and the explained variation in time to diagnosis of lung cancer [R2D]) and calibration plots to evaluate model performance by sex and ethnicity, using data from QResearch (4·14 million people for internal validation) and CPRD (2·54 million for external validation). Seven models for predicting lung cancer risk (Liverpool Lung Project [LLP]v2, LLPv3, Lung Cancer Risk Assessment Tool [LCRAT], Prostate, Lung, Colorectal, and Ovarian [PLCO]M2012, PLCOM2014, Pittsburgh, and Bach) were selected to compare their model performance with the CanPredict (lung) model using two approaches: (1) in ever-smokers aged 55-74 years (the population recommended for lung cancer screening in the UK), and (2) in the populations for each model determined by that model's eligibility criteria.

Findings: There were 73 380 incident lung cancer cases in the QResearch derivation cohort, 22 838 cases in the QResearch internal validation cohort, and 16 145 cases in the CPRD external validation cohort during follow-up. The predictors in the final model included sociodemographic characteristics (age, sex, ethnicity, Townsend score), lifestyle factors (BMI, smoking and alcohol status), comorbidities, family history of lung cancer, and personal history of other cancers. Some predictors were different between the models for women and men, but model performance was similar between sexes. The CanPredict (lung) model showed excellent discrimination and calibration in both internal and external validation of the full model, by sex and ethnicity. The model explained 65% of the variation in time to diagnosis of lung cancer R2D in both sexes in the QResearch validation cohort and 59% of the R2D in both sexes in the CPRD validation cohort. Harrell's C statistics were 0·90 in the QResearch (validation) cohort and 0·87 in the CPRD cohort, and the D statistics were 2·8 in the QResearch (validation) cohort and 2·4 in the CPRD cohort. Compared with seven other lung cancer prediction models, the CanPredict (lung) model had the best performance in discrimination, calibration, and net benefit across three prediction horizons (5, 6, and 10 years) in the two approaches. The CanPredict (lung) model also had higher sensitivity than the current UK recommended models (LLPv2 and PLCOM2012), as it identified more lung cancer cases than those models by screening the same amount of individuals at high risk.

Interpretation: The CanPredict (lung) model was developed, and internally and externally validated, using data from 19·67 million people from two English primary care databases. Our model has potential utility for risk stratification of the UK primary care population and selection of individuals at high risk of lung cancer for targeted screening. If our model is recommended to be implemented in primary care, each individual's risk can be calculated using information in the primary care electronic health records, and people at high risk can be identified for the lung cancer screening programme.

Funding: Innovate UK (UK Research and Innovation).

Translation: For the Chinese translation of the abstract see Supplementary Materials section.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cohort Studies
  • Early Detection of Cancer
  • Female
  • Humans
  • Lung
  • Lung Neoplasms* / diagnostic imaging
  • Lung Neoplasms* / epidemiology
  • Male
  • Prospective Studies
  • Retrospective Studies
  • Risk Assessment
  • Risk Factors