Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study

Alpha Forna; Ilaria Dorigatti; Pierre Nouvellet; Christl A Donnelly

doi:10.1371/journal.pone.0257005

Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study

PLoS One. 2021 Sep 15;16(9):e0257005. doi: 10.1371/journal.pone.0257005. eCollection 2021.

Authors

Alpha Forna¹, Ilaria Dorigatti², Pierre Nouvellet^{2

3}, Christl A Donnelly^{2

4}

Affiliations

¹ School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
² Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, United Kingdom.
³ School of Life Sciences, University of Sussex, Brighton, United Kingdom.
⁴ Department of Statistics, University of Oxford, Oxford, United Kingdom.

Abstract

Background: Machine learning (ML) algorithms are now increasingly used in infectious disease epidemiology. Epidemiologists should understand how ML algorithms behave within the context of outbreak data where missingness of data is almost ubiquitous.

Methods: Using simulated data, we use a ML algorithmic framework to evaluate data imputation performance and the resulting case fatality ratio (CFR) estimates, focusing on the scale and type of data missingness (i.e., missing completely at random-MCAR, missing at random-MAR, or missing not at random-MNAR).

Results: Across ML methods, dataset sizes and proportions of training data used, the area under the receiver operating characteristic curve decreased by 7% (median, range: 1%-16%) when missingness was increased from 10% to 40%. Overall reduction in CFR bias for MAR across methods, proportion of missingness, outbreak size and proportion of training data was 0.5% (median, range: 0%-11%).

Conclusion: ML methods could reduce bias and increase the precision in CFR estimates at low levels of missingness. However, no method is robust to high percentages of missingness. Thus, a datacentric approach is recommended in outbreak settings-patient survival outcome data should be prioritised for collection and random-sample follow-ups should be implemented to ascertain missing outcomes.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Computer Simulation
Data Interpretation, Statistical
Datasets as Topic
Disease Outbreaks*
Hemorrhagic Fever, Ebola / epidemiology
Hemorrhagic Fever, Ebola / mortality*
Humans
Machine Learning*
Models, Statistical*
Research Design
Survival Analysis

Abstract

Publication types

MeSH terms

Grants and funding