Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Jun 7;8(6):e65012.
doi: 10.1371/journal.pone.0065012. Print 2013.

Hidden Markov Models for Evolution and Comparative Genomics Analysis

Affiliations
Free PMC article
Comparative Study

Hidden Markov Models for Evolution and Comparative Genomics Analysis

Nadezda A Bykova et al. PLoS One. .
Free PMC article

Abstract

The problem of reconstruction of ancestral states given a phylogeny and data from extant species arises in a wide range of biological studies. The continuous-time Markov model for the discrete states evolution is generally used for the reconstruction of ancestral states. We modify this model to account for a case when the states of the extant species are uncertain. This situation appears, for example, if the states for extant species are predicted by some program and thus are known only with some level of reliability; it is common for bioinformatics field. The main idea is formulation of the problem as a hidden Markov model on a tree (tree HMM, tHMM), where the basic continuous-time Markov model is expanded with the introduction of emission probabilities of observed data (e.g. prediction scores) for each underlying discrete state. Our tHMM decoding algorithm allows us to predict states at the ancestral nodes as well as to refine states at the leaves on the basis of quantitative comparative genomics. The test on the simulated data shows that the tHMM approach applied to the continuous variable reflecting the probabilities of the states (i.e. prediction score) appears to be more accurate then the reconstruction from the discrete states assignment defined by the best score threshold. We provide examples of applying our model to the evolutionary analysis of N-terminal signal peptides and transcription factor binding sites in bacteria. The program is freely available at http://bioinf.fbb.msu.ru/~nadya/tHMM and via web-service at http://bioinf.fbb.msu.ru/treehmmweb.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Models for the ancestral state reconstruction.
0 and 1 are two possible states at the tree nodes. The solid edges reflect transitions for the optimal states assignment; the dotted edges are non-optimal edges. The blue boxes denote optimal states at the nodes. B is the start point. (a) The discrete state model with observable states at the leaves; (b) the HMM model with observable scores and a hidden layer of states at the leaves. Note that in (b) optimal states at the leaves are chosen from the full set of states, while in (a) they are fixed to observable states.
Figure 2
Figure 2. The Up-Down algorithm.
Partitioning of a tree relative to state 1 at the node formula image is shown by dashed lines.
Figure 3
Figure 3. The comparison of the tHMM, dumbtHMM and BayesTraits on simulated data sets.
(a) The nodes, leaves and total accuracy of states reconstruction for simulations with varying transition rates and with fixed score distributions overlap value = 0.19. (b) The nodes, leaves and total accuracy of states reconstruction for fixed transition rate = 0.2 and varying score distributions overlap values. The red lines represent the tHMM results; the blue lines, BayesTraits results; the yellow lines, the results of dumbtHMM; and the green lines represent the results of the dumbtHMM reconstruction from the known assignment of states to leaves. The dashed lines show the accuracy for internal nodes reconstruction; the dotted line, the accuracy of leaves assignment (the yellow dotted line coincides with the blue dotted line); and the solid line, the mixed accuracy for all the nodes of the tree. (c) The number of reconstructed events normalized by the total tree length in the set for the same settings as in (a). (d) The number of reconstructed events normalized by the total tree length in the set for the settings as in (b). In (c) and (d), the green line represents the real number of events; the blue line, the number of events reconstructed by the BayesTraits algorithm; the red line, by the tHMM algorithm; and the yellow line, by the dumbtHMM algorithm. (e) The Matthews correlation coefficient(MCC) for the accuracy of events reconstruction for the same settings as in (a). (f) The MCC for the accuracy of events reconstruction number of reconstructed events for the settings as in (b). In (e) and (f), the green line represents the results of the dumbtHMM reconstruction from the known assignment of states to leaves; the blue line, the results of the BayesTraits algorithm; the red line, by the tHMM algorithm; and the yellow line, by the dumbtHMM algortihm.
Figure 4
Figure 4. Score distributions for states.
The blue line corresponds to the SignalP Dscore distribution of the formula image state (no signal peptide); the red one, to the distribution of the formula image state (signal peptide present).
Figure 5
Figure 5. The results of the dumbtHMM (left) and tHMM (right) algorithms applied to the signal peptide reconstruction at the amidase othologous cluster (PRK07056) tree.
The black segment of a circle reflects the posterior probability of state formula image (non-signal) at a particular node. The column with circles at the right shows the prior probability of state formula image at the leaves, calculated from the score distributions. The remaining columns left to right: Dscore, prior state, posterior state from the tHMM algorithm at the leaves.
Figure 6
Figure 6. The results of tHMM for the TFBS reconstruction of the AsnB (L-asparaginase) tree.
Notation as in Fig. 5. The score values are Z-scores.

Similar articles

See all similar articles

Cited by 4 articles

References

    1. Ekman S, Andersen HL, Wedin M (2008) The limitations of ancestral state reconstruction and the evolution of the ascus in the lecanorales (lichenized ascomycota). Systematic Biology 57: 141–156. - PubMed
    1. Hughes WOH, Oldroyd BP, Beekman M, Ratnieks FLW (2008) Ancestral monogamy shows kin selection is key to the evolution of eusociality. Science (New York, NY) 320: 1213–1216. - PubMed
    1. Montgomery SH, Capellini I, Barton RA, Mundy NI (2010) Reconstructing the ups and downs of primate brain evolution: implications for adaptive hypotheses and homo oresiensis. BMC Biology 8: 9. - PMC - PubMed
    1. Venditti C, Meade A, Pagel M (2011) Multiple routes to mammalian diversity. Nature 479: 393–396. - PubMed
    1. Pagel M, Meade A, Barker D (2004) Bayesian estimation of ancestral character states on phyloge-nies. Systematic Biology 53: 673–684. - PubMed

Publication types

Grant support

The work was supported by Russian Ministry of Education and Science (State contract No 07.514.11.4007, http://eng.mon.gov.ru/) and Russian Foundation for Basic Research (grant 11-04-02016-a to AF, by the Johns Hopkins University Framework for the Future (AF), and by the Commonwealth Foundation (AF). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding was received for this study.
Feedback