Predicting disease risks from highly imbalanced data using random forest
- PMID: 21801360
- PMCID: PMC3163175
- DOI: 10.1186/1472-6947-11-51
Predicting disease risks from highly imbalanced data using random forest
Abstract
Background: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.
Methods: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.
Results: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.
Conclusions: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
© 2011 Khalilia et al; licensee BioMed Central Ltd.
Figures
Similar articles
-
Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838. Proteins. 2008. PMID: 18186470
-
Identification of important factors in an inpatient fall risk prediction model to improve the quality of care using EHR and electronic administrative data: A machine-learning approach.Int J Med Inform. 2020 Nov;143:104272. doi: 10.1016/j.ijmedinf.2020.104272. Epub 2020 Sep 15. Int J Med Inform. 2020. PMID: 32980667 Free PMC article.
-
Stroke Prediction with Machine Learning Methods among Older Chinese.Int J Environ Res Public Health. 2020 Mar 12;17(6):1828. doi: 10.3390/ijerph17061828. Int J Environ Res Public Health. 2020. PMID: 32178250 Free PMC article.
-
Class-imbalanced classifiers for high-dimensional data.Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9. Brief Bioinform. 2013. PMID: 22408190 Review.
-
Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.J Theor Biol. 2017 Dec 21;435:208-217. doi: 10.1016/j.jtbi.2017.09.018. Epub 2017 Sep 20. J Theor Biol. 2017. PMID: 28941868 Review.
Cited by
-
A biological age model based on physical examination data to predict mortality in a Chinese population.iScience. 2024 Feb 3;27(3):108891. doi: 10.1016/j.isci.2024.108891. eCollection 2024 Mar 15. iScience. 2024. PMID: 38384842 Free PMC article.
-
Application of machine learning algorithms to predict dead on arrival of broiler chickens raised without antibiotic program.Poult Sci. 2024 Jan 30;103(4):103504. doi: 10.1016/j.psj.2024.103504. Online ahead of print. Poult Sci. 2024. PMID: 38335671 Free PMC article.
-
Predictive modeling for acute kidney injury after percutaneous coronary intervention in patients with acute coronary syndrome: a machine learning approach.Eur J Med Res. 2024 Jan 24;29(1):76. doi: 10.1186/s40001-024-01675-0. Eur J Med Res. 2024. PMID: 38268045 Free PMC article.
-
Enhancing Health Equity by Predicting Missed Appointments in Health Care: Machine Learning Study.JMIR Med Inform. 2024 Jan 12;12:e48273. doi: 10.2196/48273. JMIR Med Inform. 2024. PMID: 38214974 Free PMC article.
-
Enlarged Vestibular Aqueduct and Associated Inner Ear Malformations: Hearing Loss Prognostic Factors and Data Modeling from an International Cohort.J Int Adv Otol. 2023 Nov;19(6):454-460. doi: 10.5152/iao.2023.231044. J Int Adv Otol. 2023. PMID: 38088316 Free PMC article.
References
-
- Fuster V, Medical Underwriting for Life Insurance. McGraw-Hill's AccessMedicine; 2008.
-
- Yi T, Guo-Ji Z. The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
