Breast cancer risk prediction in African women using Random Forest Classifier

Cancer Treat Res Commun. 2021;28:100396. doi: 10.1016/j.ctarc.2021.100396. Epub 2021 May 15.


Introduction: One of the most important steps in combating breast cancer is early and accurate diagnosis. Unfortunately, breast cancer is asymptomatic at the early stage, although some symptoms are presented at a later time, but at symptomatic stage treatment could be complicated or even become impossible thereby leading to death. Proper risk assessment is hence very important in reducing mortality. Some computational techniques have been developed for breast cancer risk assessment in the developed world, but such techniques do not work well in Africa because of the difference in risk profiles of African women e.g. later menarche, low drug abuse and low smoking rate.

Aim: In this work, we propose a bespoke risk prediction model for African women using Random Forest Classifier (RFC) machine learning technique.

Methods: A total of 180 subjects were studied out of which 90 were confirmed cases of breast cancer and 90 were benign. Twenty-five risk factors were included, for example, smoking, alcohol intake, occupational hazards and age at menopause. Four approaches were empirically used in the feature selection, these are the use of Chi-Square, mutual information gain, Spearman correlation and the entire features. RFC algorithm was used to develop the prediction model.

Results: We found that family history of breast cancer, dense breast, deliberate abortion, age at first child, fruit intake and regular exercise are predictors of breast cancer. The RFC model gave an accuracy of 91.67%, sensitivity of 87.10%, specificity of 96.55% and Area under curve (AUC) of 92% when all the risk factors were included in the model while an accuracy of 96.67%, sensitivity of 93.75%, specificity of 100% and AUC of 97% were obtained when correlation-selected features were included in the model. The Chi-Square selected features gave the best performance with 98.33% accuracy, 100% sensitivity, 96.55 specificity and 98% AUC. Mutual information gain selected feature gave the same results as Chi-Square selected features.

Conclusion: Random Forest Classifier has a good potential at predicting the risk of breast cancer in African women. The study helped to identify the risk factors of breast cancer in African women. This is a valuable information which can help African women to pay attention to those risk factors with the intention of reducing the incidence of breast cancer in Africa.

Keywords: African women; Breast cancer; Feature selection; Machine learning; Random forest; Risk prediction.

MeSH terms

  • Adult
  • Africa
  • Breast Neoplasms / epidemiology*
  • Female
  • Humans
  • Machine Learning*
  • Middle Aged
  • Risk Assessment / methods*
  • Risk Factors