Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep;63(5):753-71.
doi: 10.1093/sysbio/syu039. Epub 2014 Jun 20.

Probabilistic graphical model representation in phylogenetics

Affiliations

Probabilistic graphical model representation in phylogenetics

Sebastian Höhna et al. Syst Biol. 2014 Sep.

Abstract

Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.
Figure 1.
The symbols for a visual representation of a graphical model. a) Solid squares represent constant nodes, which specify fixed-valued variables. b) Stochastic nodes are represented by solid circles. These variables correspond to random variables and may depend on other variables. c) Deterministic nodes (dotted circles) indicate variables that are determined by a specific function applied to another variable. They can be thought of as variable transformations. d) Observed states are placed in clamped stochastic nodes, represented by gray-shaded circles. e) Replication over a set of variables is indicated by enclosing the replicated nodes in a plate (dashed rectangle). f) We introduce replication over a structured tree topology using a tree plate. This is represented by the divided, dashed rectangle with rounded corners. The subsections of the tree plate demark the different classes of nodes of the tree. The tree topology orders the nodes in the tree plate and may be a constant node (as in this example) or a stochastic node (if the topology node is a solid circle).
F<sc>igure</sc> 2.
Figure 2.
An explicit graphical model of the distribution of a binary trait. Descriptions of the objects have been added for pedagogical purpose. The presence or absence of the binary trait is assumed to follow a Bernoulli distribution with parameter p. This parameter is equal to the probability of the presence of the baculum in an independently sampled species. We place a Beta prior density on the Bernoulli distribution parameter, such that p∼Beta(α,β), where α = 1 and β = 1 are the shape parameters of the Beta distribution. This probability density is defined on the interval [0,1], thus 0 ≤ p ≤ 1.
F<sc>igure</sc> 3.
Figure 3.
The evolution of a single binary character represented as a phylogenetic graphical model. a) The phylogenetic relationships of the five mammalian species. The observed state of the character (1: presence or 0: absence of the baculum) is given for each species. Other states at the internal nodes represent the unknown ancestral state. The branches of the tree (1,…,8) are labeled and assigned a fixed length (l1,…,l8). b) The corresponding graphical model, in which the species tree topology is still evident. We represent the state for each node with generic notation: S1 is the presence/absence state for node 1. The clamped nodes, in grey, indicate observed states, whereas unobserved states for ancestral species are in white. Constant nodes indicate fixed/known branch lengths. Under this model, the state for the root of the tree (S9) is drawn from a Bernoulli distribution with probability p. A Beta prior is assigned to the parameter of the Bernoulli distribution so that p∼Beta(α,β), where the parameters of the Beta distribution are constant nodes and assigned fixed values. The states of the nodes descended from the root of the tree (S1,…,S8) are dependent on the equilibrium frequency parameter (θ) and their respective branch lengths (constant nodes l1,…,l8). A second Beta distribution is applied as a prior on the parameter θ, where θ∼Beta(x,y).
F<sc>igure</sc> 4.
Figure 4.
A phylogenetic graphical model of N independently evolving binary characters. When sampling N different binary characters for each extant species, we assume that these characters are independent and identically distributed. Thus the model for each character is the same as in Figure 3b. Yet, the state for each character 1,…,N can be different. We use the plate notation to represent repetition over a vector of elements. In this figure, the dashed box and the iterator i indicate the replicated variables. Thus, the plate represents separate variables of binary character evolution for i in characters 1,2,3,…,N.
F<sc>igure</sc> 5.
Figure 5.
Explicit graphical model representation of a GTR model with a fixed tree topology. For pure convenience, we show here rooted trees that demonstrate the similarity to previous figures. The model of character evolution is a continuous time Markov model parameterized by an instantaneous rate matrix. The rate matrix Q is a deterministic variable computed by multiplying the base frequencies π with the exchangeability rates ε. A Dirichlet distribution is applied as the prior distribution on both the base frequencies π and the exchangeability rates ε. a) A GTR model with fixed branch lengths. b) A GTR model with estimated branch lengths. Each branch length is independent and identically distributed under an exponential distribution.
F<sc>igure</sc> 6.
Figure 6.
Simplified representation of the GTR model of Figure 5b using a tree plate. The tree plate, a big dashed box, divides the nodes into three classes: the root node, internal nodes and tip nodes. The character state variables are named Sij where i denotes the i-th node and j the j-th site. The root node does not have a parent node in the tree while the other nodes do. The internal nodes and the tip nodes depend on the ancestral states. The ancestral variable of node i is obtained using the parent indicator function p˜(i). Tip nodes are clamped and thus shaded. A tree topology is attached to the tree plate via the tree variable Ψ shown on the left. The tree variable informs the plate of the structure and if the tree variable changes, the structure of the resulting graph changes too.
F<sc>igure</sc> 7.
Figure 7.
Top panel: module representation of different tree priors with Ψ as a pivot node (note that Ψ denotes a time tree with edge lengths here, not simply a topology): a) Yule process (Yule 1925), b) constant rate birth–death process (Nee et al. 1994), c) decreasing speciation rate birth–death process [speciation rate: λ*exp(− αt), e.g., in Höhna (2014)] and d) Coalescent process (Kingman 1982). Bottom panel: different rate matrix modules with Q as a pivot node: e) Jukes–Cantor rate matrix where all exchangeability rates and all base frequencies are equal (Jukes and Cantor 1969), f) F81 rate matrix where all exchangeability rates are equal but the base frequencies are drawn from a Dirichlet distribution (Felsenstein 1981), g) T92 rate matrix with a parameter for the frequency of the GC content πGC and a transition-transversion rate (Tamura 1992) and h) HKY85 rate matrix with the base frequencies drawn from a Dirichlet distribution and an estimated transition-transversion rate (Hasegawa et al. 1985).
F<sc>igure</sc> 8.
Figure 8.
The graphical model of Figure 6, a GTR + Γ model, represented in modular form. a) The model is broken into five different modules: Tree, Rate matrix, Site rates, Branch rates and PhyloCTMC (Phylogenetic Continuous Time Markov Chain). By representing all modules in collapsed form, we obtain a compact high-level visualization of the model. Arrows point from upstream to downstream components in the complete model graph. b) By expanding the modules to expose the model subgraphs they contain, we obtain a detailed description of the model. Note that the four upstream modules (Tree, Rate matrix, Site rates, and Branch rates) are all named after the corresponding pivot variable. Also note that the symbols used for pivot variables are matched across connected modules, both by name and by plate or tree plate indices. Small arrows aid the search for pivot variables. The only new variable added here, mij, is the deterministically computed rate multiplier for branch i and site j, obtained by multiplying the branch length li with the branch rate ci and the site rate rj. Details of the modules are provided in the text.
F<sc>igure</sc> 9.
Figure 9.
Module representation for a species tree-gene tree model. We simply extend the previous phylogenetic model by substituting the simple tree module by a modular representation of a species tree prior and a gene-tree distribution given the species tree. The gene tree with the entire substitution process sits on a plate representing that the model is repeated across genes. The PhyloCTMC module is shaded to reflect the fact that it is clamped to observations.
F<sc>igure</sc> 10.
Figure 10.
Simulation of data using the graphical model of Figure 3. All simulated values are colored in blue. First, the root probability is drawn from a Beta(α = 1,β = 1) distribution, yielding 0.93, and the stationary probability is drawn from a Beta(x = 1,y = 1) distribution resulting 0.34. Then, the characters of the root node followed by the characters of the internal nodes and tip nodes are simulated under the two-state continuous time Markov process.
F<sc>igure</sc> 11.
Figure 11.
Message passing (belief propagation) on a tree graph. a) First phase, passing messages from the tips toward the root. b) Second phase, passing messages from the root towards the tips. After the second phase, all nodes have received messages from all of their neighbors, and their marginals can be computed. If only the probability of the entire tree or the marginals of the root node are of interest, the second phase is not needed.
F<sc>igure</sc> 12.
Figure 12.
A factor graph representing the binary character evolution model introduced in Figure 3. The factor graph additionally displays the probability distributions (the factors) as part of the model graph, for example, a Beta distribution, Bernoulli distribution and continuous time Markov chain (CTMC). A factor graph is always an undirected graph showing only the relationship between the variables and the corresponding distributions.

Similar articles

Cited by

References

    1. Ahmadi A., Serpedini E., Qaraqell K.A. Factor Graphs and Message Passing Algorithms. USA: CRC Press; 2012. Mathematical foundations for signal processing, communications, and networking, chap. 13.
    1. Blanquart S., Lartillot N. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Mol. Biol. Evol. 2006;23:2058–2071. - PubMed
    1. Bollback J. Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol. 2002;19:1171–1180. - PubMed
    1. Boussau B., Guéguen L., Gouy M. A mixture model and a hidden markov model to simultaneously detect recombination breakpoints and reconstruct phylogenies. Evol. Bioinformatics. 2009;5:67. - PMC - PubMed
    1. Brown J., ElDabaje R. Puma: Bayesian analysis of partitioned (and unpartitioned) model adequacy. Bioinformatics. 2009;25:537–538. - PubMed

Publication types