Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 23;38(9):4010-4024.
doi: 10.1093/molbev/msab149.

Fundamental Identifiability Limits in Molecular Epidemiology

Affiliations

Fundamental Identifiability Limits in Molecular Epidemiology

Stilianos Louca et al. Mol Biol Evol. .

Abstract

Viral phylogenies provide crucial information on the spread of infectious diseases, and many studies fit mathematical models to phylogenetic data to estimate epidemiological parameters such as the effective reproduction ratio (Re) over time. Such phylodynamic inferences often complement or even substitute for conventional surveillance data, particularly when sampling is poor or delayed. It remains generally unknown, however, how robust phylodynamic epidemiological inferences are, especially when there is uncertainty regarding pathogen prevalence and sampling intensity. Here, we use recently developed mathematical techniques to fully characterize the information that can possibly be extracted from serially collected viral phylogenetic data, in the context of the commonly used birth-death-sampling model. We show that for any candidate epidemiological scenario, there exists a myriad of alternative, markedly different, and yet plausible "congruent" scenarios that cannot be distinguished using phylogenetic data alone, no matter how large the data set. In the absence of strong constraints or rate priors across the entire study period, neither maximum-likelihood fitting nor Bayesian inference can reliably reconstruct the true epidemiological dynamics from phylogenetic data alone; rather, estimators can only converge to the "congruence class" of the true dynamics. We propose concrete and feasible strategies for making more robust epidemiological inferences from viral phylogenetic data.

Keywords: birth-death-sampling model; epidemiology; phylogenetics; statistical inference.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Examples of congruent epidemiological scenarios. (A–F) Birth (or speciation) rate (A), death (or extinction) rate (B), sampling rate (C), effective reproduction ratio Re=λ/(μ+ψ) (D), removal rate δ=μ+ψ (E), and sampling proportion S=ψ/(μ+ψ) (F) of a specific epidemiological scenario (thick black curves), compared with various alternative congruent (i.e., statistically indistinguishable) scenarios (dashed curves). Similarly, colored curves across subfigures A–F correspond to a specific diversification scenario. No viral phylogeny, no matter how large, could possibly distinguish between these (and in fact a myriad of other) scenarios. (G–L) Similar to A–F, but showing scenarios congruent to a different reference scenario. (M–R) Similar to A–F, but showing scenarios congruent to a different reference scenario.
Fig. 2.
Fig. 2.
Conceptual illustration of the effects of model congruencies on epidemiological reconstruction. (A) The large light-blue “balloon” represents the congruence class of a single true epidemiological scenario (dark-blue circle), in the space of all biologically plausible epidemiological scenarios. Each straight continuous line represents a limited set of models or functional forms (e.g., skyline) fitted to data generated by the true scenario, for example via maximum likelihood, in an effort of approximately reconstructing the true scenario. The specific member (i.e., with a specific parameterization) chosen among each model set will be the one closest to the congruence class (filled gray circle) or may even intersect the congruence class (filled red circle), but will not necessarily be the one closest to the true scenario (open circles). This issue persists even for infinitely large data sets. (B and C) Hypothetical illustration of a skyline model (grey curve) fitted to data generated by a hypothetical scenario (blue curve, here only showing λ). Although no practical model perfectly matches reality, in the absence of model congruencies one would nevertheless ideally expect to obtain a fit approximately resembling reality, roughly as shown in B. Instead, due to model congruencies, one can easily obtain a fit that very poorly resembles the true scenario, as in C, since the fit closest to the congruence class is not necessarily the fit closest to the true scenario. See figure 4 for a real example.
Fig. 3.
Fig. 3.
Limits to reconstructing an epidemic’s dynamics via maximum likelihood. (A–C) Maximum-likelihood estimates (grey dashed curves) of the effective reproduction ratio (Re), removal rate (δ=μ+ψ), and sampling proportion (S=ψ/(μ+ψ)) over time, based on a timetree with 175,440 tips simulated under a hypothetical BDS scenario (blue continuous curves). Rates are in day1. Model adequacy was confirmed via parametric bootstrapping with multiple test statistics. Observe the poor agreement between the estimated and true profiles. (D–F) Maximum-likelihood estimates of the dLTT curve (normalized to have unit area under the curve), deterministic branching density (β˜), and deterministic sampling density (σ˜), corresponding to the same fitted model as in A–C, compared with their true profiles. The good agreement between the inferred and true profiles shows that the fitted model converged toward the true epidemiological scenario’s congruence class but not the true scenario itself. (G–I) Maximum-likelihood estimates of Re, δ, and S inferred from the same data as in A–F, while fixing the sampling rate ψ to its true profile. For additional BDS parameters see supplementary figure S1, Supplementary Material online. For a statistical analysis of estimation accuracies across trees simulated from many random scenarios, see supplementary figures S17 and S19, Supplementary Material online.
Fig. 4.
Fig. 4.
Limits to reconstructing an epidemic’s dynamics in a Bayesian framework. (A–C) Posterior distributions of the effective reproduction ratio (Re=λ/(μ+ψ)), removal rate (δ=μ+ψ), and sampling proportion (S=ψ/(μ+ψ)), as inferred from 590 sequences simulated under a hypothetical BDS scenario (blue curves) using BEAST2. Black curves show posterior median, dark and light shades represent equal-tailed 50%- and 95%-credible intervals of the posterior. All rates are in yr1. The present-day sampling proportion was fixed to its true value during fitting to account for previously reported identifiability issues in skyline models (Stadler et al. 2013). Model adequacy was confirmed using predictive posterior simulations with multiple test statistics. Observe the poor agreement between the posterior predictions and the true profiles. For additional epidemiological parameters (λ, μ, ψ), see supplementary figure S5, Supplementary Material online. For the molecular evolution parameters, see supplementary figure S10, Supplementary Material online. (D–F) Distributions of the dLTT curves (normalized to have unit area under the curve), deterministic branching densities (β˜), and deterministic sampling densities (σ˜), corresponding to the same posterior models as in A–C, compared with their true profiles (blue curves). The relatively good agreement between the inferred and true profiles shows that BEAST2 closely reconstructed the true epidemiological history’s congruence class, but not the true epidemiological history itself. (G–I) Posterior distributions of Re, δ, and S inferred from the same data, while fixing the present-day sampling proportion and the removal rate’s profile to their true values. For additional parameters, see supplementary figures S6 and S11, Supplementary Material online.
Fig. 5.
Fig. 5.
Bayesian reconstruction of HIV spread is compromised by model congruencies. (A–D) Specified priors for BDS (skyline) model parameters of HIV-1 subtype B in Northern Alberta, reflecting our a priori knowledge of the plausible range of these parameters. (E–H) Distribution of BDS parameters over time, based on models sampled from the posterior distribution by BEAST2. At each time point, the black curve shows the median value of a parameter across all posterior-sampled models, whereas the dark and light shadings show 50% and 95% equal-tailed highest posterior density intervals, respectively. (I–L) Maximum posterior probability BDS “reference” scenario (continuous black curves) compared with multiple alternative “congruent” scenarios (dashed curves). Each congruent scenario would generate timetrees with the same probability distribution as the reference scenario and is thus statistically indistinguishable from the latter. For the posterior distributions of molecular evolution parameters, see supplementary figure S23, Supplementary Material online.

Similar articles

Cited by

References

    1. Akaike H.1981. Likelihood of a model and information criteria. J Econom. 16(1):3–14.
    1. Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, Huelsenbeck JP, Ronquist F, Swofford DL, Cummings MP, et al.2012. Beagle: an application programming interface and high-performance computing library for statistical phylogenetics. Syst Biol. 61(1):170–173. - PMC - PubMed
    1. Bhaskar A, Song YS.. 2014. Descartes’rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 42(6):2469–2493. - PMC - PubMed
    1. Boskova V, Bonhoeffer S, Stadler T.. 2014. Inference of epidemiological dynamics based on simulated phylogenies using birth-death and coalescent models. PLOS Comput Biol. 10(11):e1003913. - PMC - PubMed
    1. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, De Maio N, et al.2019. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 15(4):e1006650. - PMC - PubMed

Publication types

Grants and funding