A practical model for the identification of congenital cataracts using machine learning

EBioMedicine. 2020 Jan:51:102621. doi: 10.1016/j.ebiom.2019.102621. Epub 2020 Jan 3.

Abstract

Background: Approximately 1 in 33 newborns is affected by congenital anomalies worldwide. We aimed to develop a practical model for identifying infants with a high risk of congenital cataracts (CCs), which is the leading cause of avoidable childhood blindness.

Methods: This case-control study was performed in the Zhongshan Ophthalmic Center and involved 2005 subjects, including 1274 children with CCs and 731 healthy controls. The CC identification models were established based on birth conditions, family medical history, and family environmental factors using the random forest (RF) and adaptive boosting methods (trained by 1129 CC cases and 609 healthy controls), which were tested by internal 4-fold cross-validation and external validation (145 CC cases and 122 healthy controls). The models were also tested using 4 datasets with gradually reduced proportions of CC patients (bilateral cases) to validate their performance in an approximate simulation of a clinical environment with a relatively low disease prevalence.

Findings: The CC identification models showed high discrimination in both the 4-fold cross validation (area under the curve (AUC)=0.91 [95% confidence interval: 0.88-0.94] in bilateral cases; 0.82 [0.77-0.89] in unilateral cases) and external validation (AUC=0.93±0.05 in bilateral cases; 0.86±0.01 in unilateral cases), and achieved stable performance in the clinical tests (AUC=0.94-0.96 in the four subgroups by RF). Furthermore, family history of CC, low parental education level, and comorbidity were identified as the top three most relevant factors to both bilateral and unilateral CC diagnosis.

Interpretation: Our CC identification models can accurately discriminate CC patients from healthy children and have the potential to serve as a complementary screening procedure, especially in undeveloped and remote areas.

Keywords: Congenital anomaly; Congenital cataract; Identification model; Machine learning.

MeSH terms

  • Algorithms
  • Area Under Curve
  • Case-Control Studies
  • Cataract / congenital*
  • Cataract / diagnosis*
  • Cataract / genetics
  • Child, Preschool
  • Female
  • Humans
  • Inheritance Patterns / genetics
  • Machine Learning*
  • Male
  • Models, Biological*
  • ROC Curve
  • Reproducibility of Results
  • Risk Factors