A Bayesian sampler for optimization of protein domain hierarchies

Andrew F Neuwald

doi:10.1089/cmb.2013.0099

A Bayesian sampler for optimization of protein domain hierarchies

J Comput Biol. 2014 Mar;21(3):269-86. doi: 10.1089/cmb.2013.0099. Epub 2014 Feb 4.

Author

Andrew F Neuwald¹

Affiliation

¹ Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine , Baltimore, Maryland.

Abstract

The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence / genetics*
Base Sequence
Bayes Theorem
Computer Simulation
Markov Chains
Monte Carlo Method
Protein Structure, Tertiary / genetics*
Sequence Analysis, Protein / methods*

Grants and funding

HHSN2630000999571/PHS HHS/United States