[Development and evaluation of a machine learning prediction model for large for gestational age]

Zhonghua Liu Xing Bing Xue Za Zhi. 2021 Dec 10;42(12):2143-2148. doi: 10.3760/cma.j.cn112338-20210824-00677.
[Article in Chinese]

Abstract

Objective: To develop and validate a useful predictive model for large gestational age (LGA) in pregnancy using a machine learning (ML) algorithm and compare its performance with the traditional logistic regression model. Methods: Data were obtained from the National Free Preconception Health Examination Project in China, carried out in 220 counties of 31 provinces from 2010 to 2012, covering all rural couples with a planned pregnancy. This study included all teams of childbearing age who delivered newborns within 24-42 weeks of gestational age and their newborns. Ten different ML algorithms were used to establish LGA prediction models, and the prediction performance of these models was evaluated. Results: A total of 104 936 newborns were included, including 54 856 boys (52.3%) and 50 080 girls (47.7%). The incidence of LGA was 11.7% (12 279). The imbalance between the two groups was addressed by the under- sampling technique, after which the overall performance of the ML models was significantly improved. The CatBoost model achieved the highest area under the receiver-operating-characteristic curve (AUC) value of 0.932. The logistic regression model had the worst performance, with an AUC of 0.555. Conclusions: In predicting the risk for LGA in pregnancy, the ML algorithms outperform the traditional logistic regression method. Compared to other ML algorithms, CatBoost could improve the performance, and it deserves further investigation.

目的: 开发和验证基于机器学习算法的孕期大于胎龄儿(LGA)风险预测模型,并比较其与传统逻辑回归方法建模的性能差异。 方法: 研究对象来自“中国免费孕前优生健康检查项目”,于2010-2012年在全国31个省市的220个县开展,覆盖全部农村计划妊娠夫妇,本研究选取分娩新生儿胎龄在24~42周内,单胎活产的所有育龄期夫妇及其新生儿为研究对象。应用10种机器学习算法分别建立LGA预测模型,评估模型对LGA的预测性能。 结果: 最终纳入104 936名新生儿,男婴54 856例(52.3%),女婴50 080例(47.7%),LGA的发生率为11.7%(12 279例)。经过下采样数据平衡处理后,机器学习方法建立模型的整体效能出现明显提高,其中以CatBoost模型在预测LGA风险方面表现最佳,模型的受试者工作特征曲线的曲线下面积(AUC)为0.932;逻辑回归模型表现最差,AUC仅为0.555。 结论: 与传统的逻辑回归方法相比,通过机器学习算法可建立更有效的孕期LGA风险预测模型,具有潜在的应用价值。.

MeSH terms

  • Algorithms*
  • Female
  • Gestational Age
  • Humans
  • Infant
  • Infant, Newborn
  • Logistic Models
  • Machine Learning*
  • Male
  • Pregnancy
  • ROC Curve