Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 6;13(1):3896.
doi: 10.1038/s41467-022-31511-0.

Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks

Affiliations

Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks

J Voznica et al. Nat Commun. .

Abstract

Widely applicable, accurate and fast inference methods in phylodynamics are needed to fully profit from the richness of genetic data in uncovering the dynamics of epidemics. Standard methods, including maximum-likelihood and Bayesian approaches, generally rely on complex mathematical formulae and approximations, and do not scale with dataset size. We develop a likelihood-free, simulation-based approach, which combines deep learning with (1) a large set of summary statistics measured on phylogenies or (2) a complete and compact representation of trees, which avoids potential limitations of summary statistics and applies to any phylodynamics model. Our method enables both model selection and estimation of epidemiological parameters from very large phylogenies. We demonstrate its speed and accuracy on simulated data, where it performs better than the state-of-the-art methods. To illustrate its applicability, we assess the dynamics induced by superspreading individuals in an HIV dataset of men-having-sex-with-men in Zurich. Our tool PhyloDeep is available on github.com/evolbioinfo/phylodeep .

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Birth-death models.
a Birth-death model (BD),, b birth-death model with Exposed-Infectious individuals (BDEI),, and c birth-death model with SuperSpreading (BDSS),. BD is the simplest generative model, used to estimate R0 and the infectious period (1/γ),. BDEI and BDSS are extended version of BD. BDEI enables to estimate latency period (1/ε) during which individuals of exposed class E are infected, but not infectious,,. BDSS includes two populations with heterogeneous infectiousness: the so-called superspreading individuals (S) and normal spreaders (N). Superspreading individuals are present only at a low fraction in the population (fss) and may transmit the disease at a rate that is multiple times higher than that of normal spreaders (rate ratio = Xss),. Superspreading can have various complex causes, such as the heterogeneity of immune response, disease progression, co-infection with other diseases, social contact patterns or risk behaviour, etc. Infectious individuals I (superspreading infectious individuals IS and normal spreaders IN for BDSS), transmit the disease at rate β (βX,Y for an individual of type X transmitting to an individual of type Y for BDSS), giving rise to a newly infected individual. The newly infected individual is either infectious right away in BD and BDSS or goes through an exposed state before becoming infectious at rate ε in BDEI. Infectious individuals are removed at rate γ. Upon removal, they can be sampled with probability s, becoming of removed sampled class R. If not sampled upon removal, they move to non-infectious unsampled class U.
Fig. 2
Fig. 2. Pipeline for training neural networks on phylogenies.
Tree representations: a (i), simulated binary trees. Under each model from Fig. 1, we simulate many trees of variable size (50 to 200 tips for ‘small trees’ and 200 to 500 tips for ‘large trees’). For illustration, we have here a tree with 5 tips. We encode the simulations into two representations, either a (ii–v), in a complete and compact tree representation called ‘Compact Bijective Ladderized Vector’ abbreviated as CBLV or a (vi) with summary statistics (SS). CBLV is obtained through a (ii) ladderization or sorting of internal nodes so that the branch supporting the most recent leaf is always on the left and a (iii) an inorder tree traversal, during which we append to a real-valued vector for each visited internal node its distance to the root and for each visited tip its distance to the previously visited internal node. We reshape this representation into a (iv), an input matrix in which the information on internal nodes and leaves is separated into two rows. Finally, a (v), we complete this matrix with zeros so that the matrices for all simulations have the size of largest simulation matrices. For illustration purpose, we here consider that the maximum tree size covered by simulations is 10, and the representation is thus completed with 0 s accordingly. SS consists of a (vi), a set of 98 statistics: 83 published in Saulnier et al., 14 on transmission chains and 1 on tree size. The information on sampling probability is added to both representations. b Neural networks are trained on these representations to estimate parameter values or to select the underlying model. For SS, we use, b (i), a deep feed-forward neural network (FFNN) of funnel shape (we show the number of neurons above each layer). For the CBLV representation we train, b (ii), convolutional neural networks (CNN). The CNN is added on top of the FFNN. The CNN combines convolutional, maximum pooling and global average pooling layers, as described in detail in ‘Methods’ and Supplementary Information.
Fig. 3
Fig. 3. Assessment of deep learning accuracy.
Comparison of inference accuracy by BEAST2 (in blue), deep neural network trained on SS (in orange) and convolutional neural network trained on the CBLV representation (in green) on 100 test trees. The size of training and testing trees was uniformly sampled between 200 and 500 tips. We show the relative error for each test tree. The error is measured as the normalized distance between the median a posteriori estimate by BEAST2 or point estimates by neural networks and the target value for each parameter. We highlight simulations for which BEAST2 did not converge and whose values were thus set to median of the parameter subspace used for simulations, by depicting them as red squares. We further highlight the analyses with a high relative error (>1.00) for one of the estimates, as black diamonds. We compare the relative errors for a BD-simulated, b BDEI-simulated and c BDSS-simulated trees. Average relative error is displayed for each parameter and method in corresponding colour below each figure. The average error of a FFNN trained on summary statistics but with randomly permuted target is displayed as black dashed line and its value is shown in bold black below the x-axis. The accuracy of each method is compared by two-sided paired z-test; P < 0.05 is shown as thick full line; non-significant is not shown.
Fig. 4
Fig. 4. Deep learning accuracy with ‘huge’ trees.
Comparison of inference accuracy by neural networks trained on large trees in predicting large trees (CNN-CBLV, in grey, same as in Fig. 3) and huge trees (FFNN-SS, in orange, and CBLV-NN, in pink) on 100 large and 100 huge test trees. The training and testing large trees are the same as in Fig. 3 (between 200 and 500 tips each). The huge testing trees were generated for the same parameters as the large training and testing trees, but their size varied between 5000 and 10,000 tips. We show the relative error for each test tree. The error is measured as the normalized distance between the point estimates by neural networks and the target values for each parameter. We compare the relative errors for a BD-simulated, b BDEI-simulated and c BDSS-simulated trees. Average relative error is displayed for each parameter and method in corresponding colour below each plot.
Fig. 5
Fig. 5. Parameter inference on HIV data sampled from MSM in Zurich.
Using BDSS model with BEAST2 (in blue), FFNN-SS (in orange), and CNN-CBLV (in green) we infer: a (i) basic reproduction number, a (ii) infectious period (in years), a (iii) superspreading transmission ratio, and a (iv) superspreading fraction. For FFNN-SS and CNN-CBLV, we show the posterior distributions and the 95% CIs obtained with a fast approximation of the parametric bootstrap (‘Methods’, Supplementary Information). For BEAST2, the posterior distributions and 95% CI were obtained considering all reported steps (9000 in total) excluding the 10% burn-in. Arrows show the position of the original point estimates obtained with FFNN-SS and CNN-CBLV and the median a posteriori estimate obtained with BEAST2. Circles show lower and upper boundaries of 95% CI. b These values are reported in a table, together with point estimates obtained while considering lower and higher sampling probabilities (0.20 and 0.30). c 95% CI boundaries obtained with FFNN-SS are used to perform an a posteriori model adequacy check. We simulated 10,000 trees with BDSS while resampling each parameter from a uniform distribution, whose upper and lower bounds were defined by the 95% CI. We then encoded these trees into SS, performed PCA and projected SS obtained from the HIV MSM phylogeny (red stars) on these PCA plots. We show here the projection into c (i) first two components of PCA, c (ii) the 3rd and 4th components, together with the associated percentage of variance displayed in parentheses. Warm colours correspond to high density of simulations.

Similar articles

Cited by

References

    1. Grenfell BT, et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science. 2004;303:327–332. doi: 10.1126/science.1090727. - DOI - PubMed
    1. Volz EM, Kosakovsky Pond SL, Ward MJ, Leigh Brown AJ, Frost SD. Phylodynamics of infectious disease epidemics. Genetics. 2009;183:1421–1430. doi: 10.1534/genetics.109.106021. - DOI - PMC - PubMed
    1. Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evolution. 2005;22:1185–1192. doi: 10.1093/molbev/msi103. - DOI - PubMed
    1. Stadler T. Birth–death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV) Proc. Natl Acad. Sci. USA. 2013;110:228–233. doi: 10.1073/pnas.1207965110. - DOI - PMC - PubMed
    1. Stadler T, Bonhoeffer S. Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods. Philos. Trans. R. Soc. B: Biol. Sci. 2013;368:20120198. doi: 10.1098/rstb.2012.0198. - DOI - PMC - PubMed

Publication types