Dirichlet mixtures, the Dirichlet process, and the structure of protein space

J Comput Biol. 2013 Jan;20(1):1-18. doi: 10.1089/cmb.2012.0244.


The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we use the Dirichlet process to derive such mixtures with an unbounded number of components. This application of the method requires several technical innovations to sample an unbounded number of Dirichlet-mixture components. The resulting Dirichlet mixtures model multiple-alignment data substantially better than do previously derived ones. They consist of over 500 components, in contrast to fewer than 40 previously, and provide a novel perspective on the structure of proteins. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino acid multinomial space.

Publication types

  • Research Support, N.I.H., Intramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Computational Biology
  • Likelihood Functions
  • Markov Chains
  • Mathematical Concepts
  • Models, Statistical
  • Monte Carlo Method
  • Probability Theory
  • Proteins / chemistry*
  • Proteins / genetics*
  • Sequence Alignment / statistics & numerical data*
  • Statistics, Nonparametric


  • Proteins