Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 23;11(10):e1004343.
doi: 10.1371/journal.pcbi.1004343. eCollection 2015 Oct.

Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling

Affiliations

Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling

Armin Meier et al. PLoS Comput Biol. .

Abstract

Homology modeling predicts the 3D structure of a query protein based on the sequence alignment with one or more template proteins of known structure. Its great importance for biological research is owed to its speed, simplicity, reliability and wide applicability, covering more than half of the residues in protein sequence space. Although multiple templates have been shown to generally increase model quality over single templates, the information from multiple templates has so far been combined using empirically motivated, heuristic approaches. We present here a rigorous statistical framework for multi-template homology modeling. First, we find that the query proteins' atomic distance restraints can be accurately described by two-component Gaussian mixtures. This insight allowed us to apply the standard laws of probability theory to combine restraints from multiple templates. Second, we derive theoretically optimal weights to correct for the redundancy among related templates. Third, a heuristic template selection strategy is proposed. We improve the average GDT-ha model quality score by 11% over single template modeling and by 6.5% over a conventional multi-template approach on a set of 1000 query proteins. Robustness with respect to wrong constraints is likewise improved. We have integrated our multi-template modeling approach with the popular MODELLER homology modeling software in our free HHpred server http://toolkit.tuebingen.mpg.de/hhpred and also offer open source software for running MODELLER with the new restraints at https://bitbucket.org/soedinglab/hh-suite.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. MODELLER’s statistical approach to homology modeling: The unknown distance d between two atoms in residues i and j of the query protein (Q) is described by a probability distribution Prob(d) that is peaked around the distance d t between the corresponding atoms in residues i′ and j′ of the template protein (T).
This distribution Prob(d) is a probabilistic distance restraint for the distance d. To model a protein, tens to hundreds of thousands of such distance restraints between pairs of atoms in the query protein are derived. The product of all these restraint functions, which is called the likelihood function in statistics, quantifies how well a model structure satisfies all restraints at the same time. Therefore, the model structure that maximises the likelihood function represents the best solution.
Fig 2
Fig 2. Empirical log distance distributions between pairs of atoms are well modelled by a two-component Gaussian mixture composed of a signal component and a background component.
The background component originates from pairs of residues with an alignment error. The plots show the empirical distribution of log d − log d t = log d ij − log d ij for thousands of sampled pairs of residues (i, i′), (j, j′) from real, error-containing pairwise sequence alignments generated with HHalign [15]. The two-component Gaussian mixture distribution predicted by the mixture density network in Fig 3B is plotted in red. From (A) to (C), the reliability of the alignments at (i, i′) and (j, j′) (as measured by pp and sim values) decreases. Consequently, the weight of the background component increases at the expense of the signal component. (D) Same as (C) but showing the distribution of N − O distances instead of Cα − Cα distances.
Fig 3
Fig 3
(A) Illustration of the two-component Gaussians mixture distribution in Eq (1). (B) Mixture density network to predict the parameters (w, μ, σ, μ bg, σ bg) of the Gaussian mixture distribution given the three variables θ = (log d t, pp, sim) (d t: distance in template, pp: posterior probability for both aligned residue pairs to be correctly aligned, sim: sequence similarity). Since the background component does not depend on d t, the nodes for μ bg and σ bg are only connected to the two lowest hidden nodes that are not connected to log d t.
Fig 4
Fig 4. Comparison of how restraints from multiple templates are combined in Modeller (top row) and in our new approach (bottom row).
(A) In Modeller, two restraints functions (green and blue) are additively mixed with mixing weights that have to be learned on a set of triples of aligned protein structures. (B) Our new restraints are multiplied instead of being added. The background component ensures that the restraint function becomes constant and the restraint thus becomes inactive (i.e. ignored) when the distance d is far from the distance in the template. (C) Modeller’s additive mixing leads to a total restraint function that is wider than any of the single-template restraints, not narrower as it should. (D) The multiplication of restraints functions according to probability theory leads to the desired behaviour of the total restraint function becoming more pointed with each restraint. Note that our new restraints are expressed as odds instead of densities (see also Eq 6).
Fig 5
Fig 5. Iterative scheme for computing weights for templates by transforming the phylogenetic tree connecting them and the query protein into an equivalent tree with star-like topology with the query in the center.
(A) Templates t 1 and t 2 are closely related and should be down-weighted with respect to t 3. (B) Any tree T with a structure at an internal node with unknown distance d h to which all templates are connected in a star-like topology (top) can be transformed into an equivalent tree T (bottom) with star-like topology, where equivalence means that the restraint on the distance d 0 of the top node is the same for both trees. τ 1, … τ K indicate evolutionary distances. (C) Iterative restructuring of a phylogenetic tree. In each step, the basic transformation from Fig 5B is applied to the subtree colored in blue. Weights and edge lengths get updated until all templates are directly connected to the query.
Fig 6
Fig 6. Selection of multiple templates.
Tacc is the set of accepted templates, L is the set of template candidates. For each template in L, its score is calculated according to Eq (14) and the template with the highest score (t 4) is added to Tacc. This process is iterated until there is no more template with a positive score, or Tacc contains more than 8 templates.
Fig 7
Fig 7
(A) Our two-component mixture restraints improve GDT-ha model quality over Modeller’s default restraints in multi-template modelling by 2.5% on average. (B) Our multi-template selection strategy improves GDT-ha scores over the simple multi-template selection strategy by 3.9% on average. (C) Multi-template modeling improves GDT-ha scores over single-template modelling (using Modeller restraints) by 4.3% on average. (D) Overall improvements through new restraints, template weights, and the new multiple template selection over the baseline, single-template version (s.1st.old in Table 1) is 11.1%.
Fig 8
Fig 8. Cumulative Z-score of all server predictions in the template-based modeling category of the CASP9 and CASP10 community-wide assessment of techniques for protein structure prediction [1, 3].
HHpred servers are red, other servers using our HHsuite software are shown in green.

Similar articles

Cited by

References

    1. Mariani V, Kiefer F, Schmidt T, Haas J, Schwede T (2011) Assessment of template based protein structure predictions in CASP9. Proteins 79 Suppl 1: 37–58. - PubMed
    1. Kinch L, Yong Shi S, Cong Q, Cheng H, Liao Y, et al. (2011) CASP9 assessment of free modeling target predictions. Proteins 79 Suppl 10: 59–73. 10.1002/prot.23181 - DOI - PMC - PubMed
    1. Huang Yea (2013) Assessment of template-based protein structure predictions in CASP10. Proteins 2: 43–56. - PMC - PubMed
    1. Yan R, Xu D, Yang J, Walker S, Zhang Y (2013) A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Scientific reports 3: 2619 10.1038/srep02619 - DOI - PMC - PubMed
    1. Kryshtafovych Aea (2014) CASP10 results compared to those of previous CASP experiments. Proteins 82: 164–174. 10.1002/prot.24448 - DOI - PMC - PubMed

Publication types

Grants and funding

This work was funded by the German Federal Ministry of Education and Research (BMBF) within the framework of e:Med (grant e:AtheroSysMed, 01ZX1313A-2014), by the Deutsche Forschungsgemeinschaft (http://www.dfg.de/en/) grant numbers: GRK1721, SFB64 and by BioSysNet (http://www.biosysnet.de/) to JS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript