Comparison and development of machine learning tools for the prediction of chronic obstructive pulmonary disease in the Chinese population

Xia Ma; Yanping Wu; Ling Zhang; Weilan Yuan; Li Yan; Sha Fan; Yunzhi Lian; Xia Zhu; Junhui Gao; Jiangman Zhao; Ping Zhang; Hui Tang; Weihua Jia

doi:10.1186/s12967-020-02312-0

Comparison and development of machine learning tools for the prediction of chronic obstructive pulmonary disease in the Chinese population

J Transl Med. 2020 Mar 31;18(1):146. doi: 10.1186/s12967-020-02312-0.

Authors

Xia Ma^{1

2}, Yanping Wu³, Ling Zhang⁴, Weilan Yuan^{5

6}, Li Yan⁷, Sha Fan⁸, Yunzhi Lian⁹, Xia Zhu³, Junhui Gao^{5

6}, Jiangman Zhao^{5

6}, Ping Zhang¹⁰, Hui Tang^{11

12}, Weihua Jia¹³

Affiliations

¹ Department of Pulmonary and Critical Care Medicine, General Hospital of Datong Coal Mine Group Co., Ltd., Datong, 037000, China.
² Department of Pulmonary and Critical Care Medicine, The First Hospital of Shanxi Medical University, Taiyuan, 030001, China.
³ Department of Respiratory, General Hospital of Tisco (Sixth Hospital of Shanxi Medical University), 2 Yingxin Street, Jiancaoping District, Taiyuan, 030008, Shanxi Province, China.
⁴ Department of Respiratory, Linfen People's Hospital, Linfen, 041000, China.
⁵ Shanghai Biotecan Pharmaceuticals Co., Ltd., 180 Zhangheng Road, Shanghai, 201204, China.
⁶ Shanghai Zhangjiang Institute of Medical Innovation, Shanghai, 201204, China.
⁷ Department of Respiratory Medicine, Hebei General Hospital, Shijiazhuang, 050000, China.
⁸ Department of Respiratory Medicine, Heji Hospital Affiliated with Changzhi Medical College, Changzhi, 046011, China.
⁹ Department of Clinical Laboratory, JinCheng People's Hospital, Jincheng, 048000, China.
¹⁰ Department of Clinical Laboratory, Linfen People's Hospital, West of Rainbow Bridge, West Binhe Road, Yaodu District, Linfen, 041000, Shanxi Province, China. ping209@163.com.
¹¹ Shanghai Biotecan Pharmaceuticals Co., Ltd., 180 Zhangheng Road, Shanghai, 201204, China. tang11_23@126.com.
¹² Shanghai Zhangjiang Institute of Medical Innovation, Shanghai, 201204, China. tang11_23@126.com.
¹³ Department of Respiratory, General Hospital of Tisco (Sixth Hospital of Shanxi Medical University), 2 Yingxin Street, Jiancaoping District, Taiyuan, 030008, Shanxi Province, China. 1051569807@qq.com.

Abstract

Background: Chronic obstructive pulmonary disease (COPD) is a major public health problem and cause of mortality worldwide. However, COPD in the early stage is usually not recognized and diagnosed. It is necessary to establish a risk model to predict COPD development.

Methods: A total of 441 COPD patients and 192 control subjects were recruited, and 101 single-nucleotide polymorphisms (SNPs) were determined using the MassArray assay. With 5 clinical features as well as SNPs, 6 predictive models were established and evaluated in the training set and test set by the confusion matrix AU-ROC, AU-PRC, sensitivity (recall), specificity, accuracy, F1 score, MCC, PPV (precision) and NPV. The selected features were ranked.

Results: Nine SNPs were significantly associated with COPD. Among them, 6 SNPs (rs1007052, OR = 1.671, P = 0.010; rs2910164, OR = 1.416, P < 0.037; rs473892, OR = 1.473, P < 0.044; rs161976, OR = 1.594, P < 0.044; rs159497, OR = 1.445, P < 0.045; and rs9296092, OR = 1.832, P < 0.045) were risk factors for COPD, while 3 SNPs (rs8192288, OR = 0.593, P < 0.015; rs20541, OR = 0.669, P < 0.018; and rs12922394, OR = 0.651, P < 0.022) were protective factors for COPD development. In the training set, KNN, LR, SVM, DT and XGboost obtained AU-ROC values above 0.82 and AU-PRC values above 0.92. Among these models, XGboost obtained the highest AU-ROC (0.94), AU-PRC (0.97), accuracy (0.91), precision (0.95), F1 score (0.94), MCC (0.77) and specificity (0.85), while MLP obtained the highest sensitivity (recall) (0.99) and NPV (0.87). In the validation set, KNN, LR and XGboost obtained AU-ROC and AU-PRC values above 0.80 and 0.85, respectively. KNN had the highest precision (0.82), both KNN and LR obtained the same highest accuracy (0.81), and KNN and LR had the same highest F1 score (0.86). Both DT and MLP obtained sensitivity (recall) and NPV values above 0.94 and 0.84, respectively. In the feature importance analyses, we identified that AQCI, age, and BMI had the greatest impact on the predictive abilities of the models, while SNPs, sex and smoking were less important.

Conclusions: The KNN, LR and XGboost models showed excellent overall predictive power, and the use of machine learning tools combining both clinical and SNP features was suitable for predicting the risk of COPD development.

Keywords: AQCI; Allele frequencies; COPD; Machine learning tools; SNP.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

China
Humans
Machine Learning*
Polymorphism, Single Nucleotide / genetics
Pulmonary Disease, Chronic Obstructive* / diagnosis
Pulmonary Disease, Chronic Obstructive* / genetics