Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 29:11:51.
doi: 10.1186/1472-6947-11-51.

Predicting disease risks from highly imbalanced data using random forest

Affiliations
Free PMC article

Predicting disease risks from highly imbalanced data using random forest

Mohammed Khalilia et al. BMC Med Inform Decis Mak. .
Free PMC article

Abstract

Background: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

Methods: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

Results: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

Conclusions: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Disease codes and categories hierarchical relationship. This is a snap shot of the hierarchical relationship between the diseases and disease categories. For instance, disease category 49 (diabetes) has a children that are represented in disease codes (ICD-9-CM).
Figure 2
Figure 2
Demographics of patients by age, race and sex for the HCUP data set.
Figure 3
Figure 3
Flow diagram of random forest and sub-sampling approach.
Figure 4
Figure 4
RF behaviour when the number of trees (ntree) varies. This plot shows how sensitivity in RF varies as the number of trees (ntree) varies, we varied ntree from 1-1001 in intervals of 25 and measured the sensitivity at every interval. Sensitivity ranged from 0.8457 when ntree = 1 and 0.8984 when ntree = 726. In our experiments we used ntree = 500 since the ntree did not have a large affect on accuracy for ntree >1.
Figure 5
Figure 5
ROC curve for diabetes mellitus. ROC curve for diabetes mellitus comparing SVM, RF, boosting and bagging.
Figure 6
Figure 6
ROC curve for hypertension. ROC curve for hypertension comparing both SVM, RF, boosting and bagging.
Figure 7
Figure 7
ROC curve for breast cancer. ROC curve for breast cancer comparing both SVM, RF, boosting and bagging.
Figure 8
Figure 8
ROC curve for breast cancer (sampling vs. non-sampling). ROC curve for breast cancer comparing RF with the sampling and non-sampling approach.
Figure 9
Figure 9
ROC curve for other circulatory diseases (sampling vs. non-sampling). ROC curve for other circulatory diseases comparing RF with the sampling and non-sampling approach.
Figure 10
Figure 10
ROC curve for peripheral atherosclerosis (sampling vs. non-sampling). ROC curve for peripheral atherosclerosis comparing RF with the sampling and non-sampling approach.

Similar articles

Cited by

References

    1. Yu W. et al.Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making. 2010;10(1):16. doi: 10.1186/1472-6947-10-16. - DOI - PMC - PubMed
    1. Hebert P. et al.Identifying persons with diabetes using Medicare claims data. American Journal of Medical Quality. 1999;14(6):270. doi: 10.1177/106286069901400607. - DOI - PubMed
    1. Fuster V, Medical Underwriting for Life Insurance. McGraw-Hill's AccessMedicine; 2008.
    1. Yi T, Guo-Ji Z. The application of machine learning algorithm in underwriting process. Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on. 2005.
    1. Cohen E. et al.Cancer coverage in general-audience and black newspapers. Health Communication. 2008;23(5):427–435. doi: 10.1080/10410230802342176. - DOI - PubMed

MeSH terms

LinkOut - more resources