Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 18;10(3):e0004549.
doi: 10.1371/journal.pntd.0004549. eCollection 2016 Mar.

Transforming Clinical Data Into Actionable Prognosis Models: Machine-Learning Framework and Field-Deployable App to Predict Outcome of Ebola Patients

Free PMC article

Transforming Clinical Data Into Actionable Prognosis Models: Machine-Learning Framework and Field-Deployable App to Predict Outcome of Ebola Patients

Andres Colubri et al. PLoS Negl Trop Dis. .
Free PMC article


Background: Assessment of the response to the 2014-15 Ebola outbreak indicates the need for innovations in data collection, sharing, and use to improve case detection and treatment. Here we introduce a Machine Learning pipeline for Ebola Virus Disease (EVD) prognosis prediction, which packages the best models into a mobile app to be available in clinical care settings. The pipeline was trained on a public EVD clinical dataset, from 106 patients in Sierra Leone.

Methods/principal findings: We used a new tool for exploratory analysis, Mirador, to identify the most informative clinical factors that correlate with EVD outcome. The small sample size and high prevalence of missing records were significant challenges. We applied multiple imputation and bootstrap sampling to address missing data and quantify overfitting. We trained several predictors over all combinations of covariates, which resulted in an ensemble of predictors, with and without viral load information, with an area under the receiver operator characteristic curve of 0.8 or more, after correcting for optimistic bias. We ranked the predictors by their F1-score, and those above a set threshold were compiled into a mobile app, Ebola CARE (Computational Assignment of Risk Estimates).

Conclusions/significance: This method demonstrates how to address small sample sizes and missing data, while creating predictive models that can be readily deployed to assist treatment in future outbreaks of EVD and other infectious diseases. By generating an ensemble of predictors instead of relying on a single model, we are able to handle situations where patient data is partially available. The prognosis app can be updated as new data become available, and we made all the computational protocols fully documented and open-sourced to encourage timely data sharing, independent validation, and development of better prediction models in outbreak response.

Conflict of interest statement

The authors have declared that no competing interests exist.


Fig 1
Fig 1. Case counts in the dataset.
The flowchart indicates the total number of positive, confirmed EVD in the original dataset, from which we took only those corresponding to patients with ages between 10 and 50. From those, only 65 have known outcome and could be used for analysis. In the bottom part of this diagram, the numbers of cases within the last 65 that contain clinical chart (24), metabolic panel (47), and virus load data (58) are represented by fill rectangles. The resulting missing data pattern illustrates that only a few patients had known information across all categories.
Fig 2
Fig 2. Prognosis-prediction protocol and app.
The flowchart in panel A synthesizes the prognosis-prediction protocol that could be used in the field by application of the models obtained in this study. Incoming patients are classified either as low or high risk depending on their age and the output of the best predictor suitable for the available clinical symptoms; any patient under 10 or above 50 years of age is considered high risk. For patients in the 10–50 age range, the best predictor that includes the clinical symptoms of the patient, among those predictors with a mean F1-score above 0.9, is selected to make a risk prediction. The Ebola CARE app (panel b) implements this protocol in an easy to use interface, where the health care worker can enter the patient’s clinical information. The app automatically chooses the best model for the data, and returns the risk estimation.
Fig 3
Fig 3. The ten variables that have the highest Mutual Information content with EVD outcome, as ranked with MIC.
This set of 10 variables include virus load (PCR), temperature, aspartate aminotransferase (AST), Alkaline Phosphatase (ALK), Alanine Aminotransferase (ALT), Creatinine (CRE), Total Carbon Dioxide (tCO2), heart rate, diarrhea, weakness, and vomit. The plot in panel A represents the eikosograms for all 10 variables, generated from the clinical records of 65 EVD patients between 10 and 50 years of age and known outcome. An eikosogram is a plot that represents the conditional probabilities of one variable (in this case outcome) as a function of the conditioning variable. Staircase shapes in an eikosogram are indicative of association. The plot in panel B shows the ranking of the 10 variables by their MIC score with outcome.
Fig 4
Fig 4. Summary of all models generated with and without PCR data.
Each point in scatter plots (a) and (c) represents a predictive model, defined by a particular selection of input variables and a prediction algorithm (LR, ANN, DT, or SVM), trained and tested 100 times. The mean F1-score (weighted average of the precision and sensitivity) calculated over the 100 testing iterations is shown in horizontal axis, while the standard deviation of the F1-score is represented in the vertical axis. The size of the point is proportional to the number of input variables. The bar plots on the right (panels b and d) show the number of times each variable appears in a predictor with mean F1-score above 0.9. Panels A and B represents the models including PCR data, while C and D, represent those without.
Fig 5
Fig 5. Optimistic-bias estimation.
The optimistic bias for the AUC scores of all top PCR (a) and non-PCR (b) predictors were estimated using bootstrap sampling method, averaging the difference between the AUC on the original data and the bootstrap samples over 100 iterations. The scatter plots show the original AUC scores for each model in the horizontal axis, the mean bias on the vertical axis, and the standard deviation of the bias as the error bar. Panels (c) and (d) show the dependency of the optimistic bias as a function of the number of imputed copies, for a logistic regression model that results of applying backward variable selection on the PCR (c) and non-PCR sets of variables (d). The backward selection algorithm was run 10 times for each number of imputed copies, and the mean bias over the 10 iterations is presented, with the standard deviation as the error bars. The bias is quite large when only one imputation is computed, but it decreases exponentially towards 0.01 as the number of multiple imputations increases. The red lines in all plots represent least squares fitted curves, using a linear function in (a, b), and an exponential curve in (c, d), thus highlighting the nature of the dependency of the optimistic bias as a function of the AUC, and the number of imputed copies.

Similar articles

See all similar articles

Cited by 10 articles

See all "Cited by" articles


    1. Ebola haemorrhagic fever in Zaire, 1976. Bull World Health Organ. 1978;56(2):271–93. - PMC - PubMed
    1. Henao-Restrepo AM, Longini IM, Egger M, Dean NE, Edmunds WJ, Camacho A, et al. Efficacy and effectiveness of an rVSV-vectored vaccine expressing Ebola surface glycoprotein: interim results from the Guinea ring vaccination cluster-randomised trial. The Lancet. 2015. Epub July 31, 2015. 10.1016/S0140-6736(15)61117-5 - DOI - PubMed
    1. Report of the Ebola Interim Assessment Panel. World Health Organization, 2015 July 2015. Report No.
    1. Sterk E, editor. Filovirus Haemorragic Fever Guideline: Médecins Sans Frontières 2008.
    1. Hingorani AD, Windt DAvd, Riley RD, Abrams K, Moons KGM, Steyerberg EW, et al. Prognosis research strategy (PROGRESS) 4: Stratified medicine research2013 2013-02-05 22:02:20. - PMC - PubMed

Publication types

LinkOut - more resources