Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Sep 24;4(9):e7148.
doi: 10.1371/journal.pone.0007148.

Highly sensitive detection of individual HEAT and ARM repeats with HHpred and COACH

Affiliations

Highly sensitive detection of individual HEAT and ARM repeats with HHpred and COACH

Fred Kippert et al. PLoS One. .

Abstract

Background: HEAT and ARM repeats occur in a large number of eukaryotic proteins. As these repeats are often highly diverged, the prediction of HEAT or ARM domains can be challenging. Except for the most clear-cut cases, identification at the individual repeat level is indispensable, in particular for determining domain boundaries. However, methods using single sequence queries do not have the sensitivity required to deal with more divergent repeats and, when applied to proteins with known structures, in some cases failed to detect a single repeat.

Methodology and principal findings: Testing algorithms which use multiple sequence alignments as queries, we found two of them, HHpred and COACH, to detect HEAT and ARM repeats with greatly enhanced sensitivity. Calibration against experimentally determined structures suggests the use of three score classes with increasing confidence in the prediction, and prediction thresholds for each method. When we applied a new protocol using both HHpred and COACH to these structures, it detected 82% of HEAT repeats and 90% of ARM repeats, with the minimum for a given protein of 57% for HEAT repeats and 60% for ARM repeats. Application to bona fide HEAT and ARM proteins or domains indicated that similar numbers can be expected for the full complement of HEAT/ARM proteins. A systematic screen of the Protein Data Bank for false positive hits revealed their number to be low, in particular for ARM repeats. Double false positive hits for a given protein were rare for HEAT and not at all observed for ARM repeats. In combination with fold prediction and consistency checking (multiple sequence alignments, secondary structure prediction, and position analysis), repeat prediction with the new HHpred/COACH protocol dramatically improves prediction in the twilight zone of fold prediction methods, as well as the delineation of HEAT/ARM domain boundaries.

Significance: A protocol is presented for the identification of individual HEAT or ARM repeats which is straightforward to implement. It provides high sensitivity at a low false positive rate and will therefore greatly enhance the accuracy of predictions of HEAT and ARM domains.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Archetypal HEAT and ARM repeats.
The examples of repeat pairs shown correspond well to the described archetype. Left: HEAT (PDB∶1w63A), residues 376–499; right: ARM (PDB∶1ialA), residues 366–455. Images were rendered with USCF Chimera ; The structure are rainbow-coloured from N-terminus (blue) to C-terminus (orange).
Figure 2
Figure 2. Sequence logos of HEAT and ARM repeats.
The logos were generated with WebLogo from the reference data sets described in Methods. Residues are shown in one letter code stacked in order of increasing frequency, with sizes proportional to their frequency at the position. The height of a column indicates the information content of the alignment at this position ranging from 0 if all amino acids are present at equal frequency to 4.32 ( = log2 20) if there is no variation at the position. Asterisks mark positions where the frequency of hydrophilic residues (R, K, H, E, Q, D, N) is below 4%, circles mark additional positions were the frequency is between 4% and 10%. The consensus helices as indicated have been calculated from the information at the PDB web site and show positions where at least 90% (red), 70% (orange), or 50% (yellow) of repeats are annotated as α-helical.
Figure 3
Figure 3. Detection of individual HEAT and ARM repeats in proteins with known structures.
The number of full repeats was taken from the structures as deposited in PDB and/or associated publications; asterisks mark ARM proteins with truncated two-helix repeats at the N-terminus which were not included in the analysis. Repeats detected are those with matches better than the lowest threshold by HHpred (i.e. E-value <50) and/or COACH(Established Repeats; score >10 for HEAT and >12 for ARM). HHpred and COACH results were grouped in four classes as described in the text and the numbers of repeats falling in the three better scoring confidence classes are given here. REP and Pfam results are for subsignificant/significant matches as returned by the servers.
Figure 4
Figure 4. Structural variation amongst individual repeats of the same protein.
Subsequent repeats of a HEAT (top, Elongation factor 3, PDB∶2iw3A, repeats 2–8) and an ARM (bottom, Plakophilin, PDB∶1xm9A, repeats 2–8) repeat protein are shown. Repeats are arranged such that the preceding repeat is approximately in the plane of the image with its central axis arranged vertically. Images were rendered with USCF Chimera .
Figure 5
Figure 5. Sequence logos of typical and diverged ARM repeats.
Comparison of “archetypal Armadillo” (top; PDB∶1g3jA, 1ee4A, 1xm9A) and more divergent ARM (bottom; PDB: 1xqrA, 1ho8A, 1upkA, 2bnxA, 2fv2A, 3dadA) repeats. The logos were generated with WebLogo from the indicated subsets of the Established Repeats data, with details as in Fig. 2.
Figure 6
Figure 6. Correlation between HHpred E-values and COACH scores.
HHpred and COACH results for identified repeats from the HEAT (top) and ARM (bottom) structures as specified in Fig. 3. Only repeats with HHpred E-values <500 and COACH scores >5 are included. The reference data sets for COACH analysis were: REP (left), Established Repeats (middle) and Pfam (right). Linear regressions are shown (all p<0.001); for comparison, the regressions for the REP (red) and Pfam (green) reference sets are also displayed in the middle panels.
Figure 7
Figure 7. Protocol for repeat detection by HHpred and COACH.
Flow chart of the protocol, for further details of the individual steps see Methods.
Figure 8
Figure 8. Repeat detection in dependence of reference and query alignments.
The outer bars summarise the results of COACH runs with the three reference alignments, and different query alignments as described in the text (left-hand bars: query alignments as used in the calibration; right-hand bars, query alignments taken from the Established Repeat data, including N-and C-terminal repeats). Red: repeats detected with all three reference alignments (hatched area: repeats also detected by HHpred); orange: detected with two of the reference alignments; yellow: detected with one of the reference alignments; white: not detected with any of the reference alignments. Arrows mark the respective detection rates achieved with the protocol, which combines the results from the HHpred and the COACH (Established Repeats) runs. The inner bars summarise the results from all runs for each repeat type, including HHpred. Black: repeats detected in all eight runs; grey: repeats detected in one to seven runs; white: repeats not detected in any of the runs. The numbers given are how many repeats fall into each category.
Figure 9
Figure 9. Fold prediction and individual repeat analysis of candidate proteins.
The results of the application of fold prediction and individual repeat analysis to selected bona fide HEAT and ARM repeat proteins fragments are shown. The two top hits retrieved from fold prediction servers FFAS03 , SAM-T06 and HHpred are given. HEAT (both eukaryotic and prokaryotic) and ARM templates are shown in bold font, ARM templates in blue. Highlighted in yellow are matches with “significant” scores (see Methods). HHpred and COACH(Established Repeats) results were grouped in four classes as described in the text and the numbers of repeats falling in the three better scoring classes are shown here. Repeats detected are those with matches better than the lowest threshold by either HHpred or COACH(Established Repeats); given in brackets is the number of potential repeats, i.e. identified repeats plus additional helical segments of appropriate size. REP and Pfam results are for subsignificant/significant matches as returned by the servers.

Similar articles

Cited by

References

    1. Aravind L, Iyer LM, Koonin EV. Comparative genomics and structural biology of the molecular innovations of eukaryotes. Curr Opin Struct Biol. 2006;16:409–419. - PubMed
    1. Andrade MA, Perez-Iratxeta C, Ponting CP. Protein repeats: structures, functions, and evolution. J Struct Biol. 2001;134:117–131. - PubMed
    1. Andrade MA, Petosa C, O'Donoghue SI, Muller CW, Bork P. Comparison of ARM and HEAT protein repeats. J Mol Biol. 2001;25:1–18. - PubMed
    1. Andrade MA, Ponting CP, Gibson TJ, Bork P. Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol. 2000;298:521–537. - PubMed
    1. Hemmings BA, Adams-Pearson C, Maurer F, Müller P, Goris J, et al. alpha- and beta-forms of the 65-kDa subunit of protein phosphatase 2A have a similar 39 amino acid repeating structure. Biochemistry. 1990;29:3166–3173. - PubMed