Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Aug;3(8):e625-e637.
doi: 10.1016/S2666-5247(21)00245-7. Epub 2022 Jan 10.

Optimising predictive models to prioritise viral discovery in zoonotic reservoirs

Affiliations
Review

Optimising predictive models to prioritise viral discovery in zoonotic reservoirs

Daniel J Becker et al. Lancet Microbe. 2022 Aug.

Erratum in

Abstract

Despite the global investment in One Health disease surveillance, it remains difficult and costly to identify and monitor the wildlife reservoirs of novel zoonotic viruses. Statistical models can guide sampling target prioritisation, but the predictions from any given model might be highly uncertain; moreover, systematic model validation is rare, and the drivers of model performance are consequently under-documented. Here, we use the bat hosts of betacoronaviruses as a case study for the data-driven process of comparing and validating predictive models of probable reservoir hosts. In early 2020, we generated an ensemble of eight statistical models that predicted host-virus associations and developed priority sampling recommendations for potential bat reservoirs of betacoronaviruses and bridge hosts for SARS-CoV-2. During a time frame of more than a year, we tracked the discovery of 47 new bat hosts of betacoronaviruses, validated the initial predictions, and dynamically updated our analytical pipeline. We found that ecological trait-based models performed well at predicting these novel hosts, whereas network methods consistently performed approximately as well or worse than expected at random. These findings illustrate the importance of ensemble modelling as a buffer against mixed-model quality and highlight the value of including host ecology in predictive models. Our revised models showed an improved performance compared with the initial ensemble, and predicted more than 400 bat species globally that could be undetected betacoronavirus hosts. We show, through systematic validation, that machine learning models can help to optimise wildlife sampling for undiscovered viruses and illustrates how such approaches are best implemented through a dynamic process of prediction, data collection, validation, and updating.

PubMed Disclaimer

Conflict of interest statement

We declare no competing interests.

Figures

Figure 1
Figure 1
Agreement across an ensemble of predictive modelling approaches Agreement across models identifying hosts with available virus data (in sample) (A) and without known viral associations (out of sample) (B). The pairwise Spearman's rank correlations between models' ranked species-level predictions were generally substantial and positive. Models were arranged in decreasing order of their mean correlation with other models. Models that used trait data made more similar predictions to each other than approaches using network methods with the same data. Network-based models that used some ecological data made more similar predictions than all other models (eg, network 4, which uses phylogeny, and hybrid 1, which uses both phylogeny and trait data). All models that could make out-of-sample predictions used trait data and showed strong agreement.
Figure 2
Figure 2
Initial ensemble predictions of the geographical and evolutionary distribution of known and predicted bat hosts of betacoronaviruses Known hosts of betacoronaviruses (A,B) are found worldwide, but particularly in southern Asia and southern Europe. Taxonomically, betacoronaviruses are less common in two superfamilies of the suborder Yangochiroptera, Noctilionoidea, and Vespertilionoidea (clade 1). The predicted in-sample bat hosts (ie, those with any viral association records; C,D) tend to recapitulate observed geographical patterns of known hosts but with a higher concentration in the Neotropics. Similarly, taxonomic patterns reflect those of known betacoronavirus hosts. In contrast, the out-of-sample bat host predictions based on phylogeny and ecological traits (E,F) are mostly clustered in Myanmar, Vietnam, and southern China, with none in the Neotropics, and North America. Predicted hosts are likewise more common in the Rhinolophidae (clade 2) and subfamilies of Old World bats (clade 5) and are rare in many Neotropical taxa (clades 1 and 7) and emballanurids (clades 3 and 4). In the phylogenies, bar height indicates betacoronavirus positivity (B) or predicted rank (D,F; higher values indicate lower proportional ranks). Colours indicate likelihood of clades to contain hosts identified through phylogenetic factorisation (red indicates clades more likely to contain hosts, blue indicates less likely hosts; appendix).
Figure 3
Figure 3
Measuring model performance with novel data Performance is based on the comparison of total predicted prevalence (ie, what proportion of species are predicted hosts of betacoronaviruses) with the sensitivity measured from validation data (ie, how many of the 47 new species are correctly identified). The null expectation for a model with a random performance is that these should be equivalent, whereas a model with strong performance will be more than that null expectation (grey line). (A) The training prevalence–test sensitivity curve is a novel diagnostic that is conceptually similar to the receiver–operator curve, in that the model is evaluated at each possible scaled rank threshold between 0 and 1. (B) The same analysis as shown in (A), but only showing the point estimate of positivity created by each model's internally calibrated threshold. For model-guided sampling, the best model would be one that predicts a low-to-medium positivity rate and has a disproportionately high sensitivity (ie, in the upper left corner). Both (A) and (B) show that the trait-based models (including the hybrid model) perform well, whereas the network-only models perform roughly at-random or worse than random (ie, close to the line); the ensemble model, which includes all eight, performs similarly to the two best trait models and better than six of the eight component models.
Figure 4
Figure 4
Comparing bat betacoronavirus host prediction with dynamic model updates Scatterplots show bat species predictions from our original ensemble in 2020 against the revised predictions after updating models with 47 new hosts (A), and the final predictions from the weighted revised ensemble (B). Species are coloured by their status in the respective revised ensemble: unlikely host, a retained suspected host, a new betacoronavirus-positive host (new host), lost as a suspected host (lost), or a novel suspected host (gained). Trendlines show a linear regression fit between the original and revised predictions against a 1:1 line, whereas dashed lines display the threshold cutoffs from each ensemble. The top ten in-sample and out-of-sample predictions from the original (C) and final (D) ensemble are also listed. *Five of the original top ten in-sample predictions, and one of the top ten out-of-sample predictions, have been empirically confirmed since the first iteration of our study.
Figure 5
Figure 5
Updated ensemble model predictions of geographical and evolutionary hotspots of bat betacoronavirus hosts (A) Geographical map of the weighted revised ensemble predictions. Most predicted undiscovered betacoronavirus hosts were found in sub-Saharan Africa and southeast Asia, especially in Malaysia and Borneo (and less so in the high-elevation mainland hotspots where most reservoirs of severe acute respiratory syndrome coronavirus-like viruses are found). (B) Phylogeny of the weighted revised ensemble predictions. Predicted hosts from this final ensemble were also most likely in the Rhinolophus genus (clade 7), several subclades of the Pteropodidae (clades 5 and 6), and the Old World Molossidae (clade 8), even though the Molossidae family as a whole had less likely hosts (clade 3). Bar height in the phylogeny indicates predicted rank, and colours indicate clades identified through phylogenetic factorisation (red indicates clades more likely to contain hosts, blue indicates clades less likely to contain hosts; appendix p 19).
Figure 6
Figure 6
Potential bridge hosts involved in SARS-CoV-2's emergence Each dot represents predicted species-level sharing probabilities with Rhinolophus affinis (A) and R malayanus (B), estimated according to the phylogeographical viral sharing model trait-3. Each coloured point is a different mammal species. Black points and error bars denote the means and standard errors of viral sharing probability for each order; the mammal orders are arranged according to their mean sharing probability, ascending from left to right. The tables below report the top 15 predicted non-bat species for R affinis and R malayanus; several families are disproportionately represented, including pangolins (order, Pholidota; family, Manidae), mustelids (order, Carnivora; family, Mustelidae), and civets (order, Carnivora; family, Viverridae). Notable species are bolded (ordered based on immediate relevance to possible origins): (a) the wild boar S scrofa and palm civet P larvata were both traded in wildlife markets in Wuhan, China, before the pandemic; as were (b) close relatives of the greater hog badger, A collaris, (the northern hog badger, A albogularis), and of the mountain weasel, M altaica, and Malayan weasel, M nudipes (the Siberian weasel, M siberica). (c) SARS-CoV-2-like viruses have been found in traded Sunda pangolins (M javanica) outside of Wuhan, China, though the species was not reported in Wuhan. (d) The ferret badger (M personata) was also reportedly of interest in WHO's origins investigation, which explored the role of wildlife farm supply chains.

Similar articles

Cited by

References

    1. Viana M, Mancy R, Biek R, et al. Assembling evidence for identifying reservoirs of infection. Trends Ecol Evol. 2014;29:270–279. - PMC - PubMed
    1. Plowright RK, Becker DJ, Crowley DE, et al. Prioritizing surveillance of Nipah virus in India. PLoS Negl Trop Dis. 2019;13:e0007393. - PMC - PubMed
    1. Becker DJ, Crowley DE, Washburne AD, Plowright RK. Temporal and spatial limitations in global surveillance for bat filoviruses and henipaviruses. Biol Lett. 2019;15:20190423. - PMC - PubMed
    1. Washburne AD, Crowley DE, Becker DJ, et al. Taxonomic patterns in the zoonotic potential of mammalian viruses. PeerJ. 2018;6:e5979. - PMC - PubMed
    1. Crowley D, Becker D, Washburne A, Plowright R. Identifying suspect bat reservoirs of emerging infections. Vaccines (Basel) 2020;8:228. - PMC - PubMed

Publication types