Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 22;114(34):9122-9127.
doi: 10.1073/pnas.1702664114. Epub 2017 Aug 7.

Origins of coevolution between residues distant in protein 3D structures

Affiliations

Origins of coevolution between residues distant in protein 3D structures

Ivan Anishchenko et al. Proc Natl Acad Sci U S A. .

Abstract

Residue pairs that directly coevolve in protein families are generally close in protein 3D structures. Here we study the exceptions to this general trend-directly coevolving residue pairs that are distant in protein structures-to determine the origins of evolutionary pressure on spatially distant residues and to understand the sources of error in contact-based structure prediction. Over a set of 4,000 protein families, we find that 25% of directly coevolving residue pairs are separated by more than 5 Å in protein structures and 3% by more than 15 Å. The majority (91%) of directly coevolving residue pairs in the 5-15 Å range are found to be in contact in at least one homologous structure-these exceptions arise from structural variation in the family in the region containing the residues. Thirty-five percent of the exceptions greater than 15 Å are at homo-oligomeric interfaces, 19% arise from family structural variation, and 27% are in repeat proteins likely reflecting alignment errors. Of the remaining long-range exceptions (<1% of the total number of coupled pairs), many can be attributed to close interactions in an oligomeric state. Overall, the results suggest that directly coevolving residue pairs not in repeat proteins are spatially proximal in at least one biologically relevant protein conformation within the family; we find little evidence for direct coupling between residues at spatially separated allosteric and functional sites or for increased direct coupling between residue pairs on putative allosteric pathways connecting them.

Keywords: homo-oligomeric contacts; protein coevolution; structural variation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
The frequency with which coevolving directly coupled residues are in contact depends on structure quality and MSA size. (A) The contact frequency of directly coupled residue pairs depends on number of sequences in family. Fraction of top 0.5 × (protein length) coevolving directly coupled residue–residue pairs identified by GREMLIN that make contacts in small (protein length L ≤ 150; blue box-and-whiskers), medium (150 < L ≤ 400; red), and large (L > 400; green) protein 3D structures. The contact prediction regime analyzed in this paper (with Meff > 103) is highlighted in gray. Two residues in the protein 3D structure were considered to be in contact if any pair of heavy atoms are within 5 Å distance. (B) The contact frequency of directly coupled residue pairs increases with increasing structure accuracy. The correlation between GREMLIN prediction accuracy and X-ray crystallographic resolution is shown in scatter (Lower) and box-and-whiskers (Upper) plots (boxes and whiskers comprise 25%, 75% and 2.5%, 97.5% percentiles, respectively; the median is shown by a solid horizontal line). (C) Comparison of GREMLIN contact prediction accuracy in X-ray and solution NMR structures for 222 proteins with structures determined using both methods; contact prediction accuracy is consistently higher for the X-ray structures. The outliers marked on B and C by PDB codes are all repeat proteins.
Fig. S1.
Fig. S1.
Resolving threading ambiguities in ribosomal proteins using coevolution data. Structures of the Thermus thermophilus 50S ribosomal protein L24 from PDB entries 4V9H (B) and 4V8H (C) along with their contact maps (A); residue pairs with dmin < 5 Å in the two structures are indicated by gray squares below and above diagonal, respectively. Strongly coevolving residue pairs identified by GREMLIN are connected by yellow lines in the structures and highlighted in blue on the contact maps. The loop marked by an arrow is two residues shorter in the first structure compared with the second one, which results in different sequence threadings, and the sequence registry shift in the C-terminal region of 4V9H (highlighted in red) results in large distances between a number of directly coupled residue pairs (red squares on the contact map), which does not occur for 4V8H. (D) Correlating the fit of the top L (L, protein length) contacts identified by GREMLIN to the two PDB structures of the ribosome; 45 protein chains were analyzed. The x and y axes show the fractions of strongly coevolving residue pairs that are in contact in the corresponding structures. Coevolution data for the five chains—L15, L18, L24, L33, and L35—fit considerably better to the 4V8H structure rather than to 4V9H. (E) Exemplar contact maps for L15, L18, L33, and L35 ribosomal proteins. Contacts in the PDB structure are in light (dmin ≤ 5.0 Å) and dark (5 Å < dmin ≤ 10.0 Å) gray; top coevolving residue pairs are in blue (dmin ≤ 5.0 Å), orange (5 Å < dmin ≤ 10.0 Å), and red (dmin > 10.0 Å). Upper and lower triangles of the maps correspond to the 4V9H and 4V8H structures, respectively.
Fig. S2.
Fig. S2.
Coevolution signal in repeat proteins. (A) Spatial structure (first model of the NMR ensemble from PDB entry 2HGH) of the transcription factor IIIA fragment bound to ribosomal RNA is shown. The fragment is composed of three zinc finger domains (orange spheres are Zn2+ ions) highlighted in different colors. (B) Contact map showing actual structural contacts (gray squares) as well as 1.5 × (protein length) contacts predicted by GREMLIN: blue squares, highly coevolving pairs that are also close in structure with dmin < 5 Å (these contacts are also visualized on A by yellow sticks); orange stars, highly coevolving residue pairs with dmin > 5 Å; red squares, false coevolution signal between repeating domains. (C) 4 × 4 fragment of the contact map that is characteristic of false signal in repeat proteins.
Fig. S3.
Fig. S3.
Distribution of GREMLIN scores between residues at different distances. (A) Distributions of GREMLIN scores for strongly coevolving directly coupled residue pairs at different distances in the query protein structure: dmin ≤ 5.0 Å (blue), 5 Å < dmin ≤ 15.0 Å (green), and dmin > 15.0 Å (red). The GREMLIN score distribution for randomly chosen pairs of residues at sequence separation ≥6 and excluding 0.5 × (protein length) most strongly coupled pairs is shown in gray. (B) Distributions of direct coevolution coupling scores between residue pairs predicted by GREMLIN for 3,784 proteins (99 repeat proteins were excluded) are shown separately for different groups of residues. Dark blue, red, and green lines correspond to monomeric contacts at three distance ranges (dmin < 5 Å, 5 Å ≤ dmin < 15 Å, and dmin > 15 Å, respectively), which are also in the same distance range in all structures of homo-oligomers and homologs (both close and distant). Conversely, distributions for the improved contacts are shown in lighter colors. Dashed gray line depicts coevolution signal between randomly chosen pairs of residues.
Fig. 2.
Fig. 2.
Amino acid and distance distributions of strongly coevolving directly coupled residue pairs. (A) Distribution of distances between directly coupled residue pairs in 3,883 high-resolution X-ray protein structures (see legend to Fig. 1). Numbers indicate the fraction of the population in the ranges (0;3), (3;5), (5;10), and (10;∞). The distribution of distances for all residue pairs in the same set of protein structures is shown in Inset and as a red line for short distances in the main panel (scale is arbitrary). (B–D) Amino acid pair composition of directly coupled residue pairs at distances 0 < dmin ≤ 3 Å (B), 3 < dmin ≤ 5 Å (C), and 5 < dmin ≤ 10 Å (D) (blue, enriched; red, depleted). (E) Pearson correlation coefficients between the amino acid pair distributions in B–D and the corresponding distributions derived for all contacts in the structure in the four distance ranges.
Fig. 3.
Fig. 3.
Origins of exceptions. The top 0.5 × (protein length) directly coupled residue pairs were analyzed for every chain from the test set of 3,883 proteins with Meff > 1,000. In Top, for each residue pair, the distance within the monomer (y axis) is plotted versus the shortest distance observed for the residue pair in A in homo-oligomeric assemblies in the PDB biological unit, (B) homologous PDB structures detected by the HHsearch program (HMM constructed from the initial MSA was searched against the database of HMMs for the entire PDB to select matches with E-value < 1E-20), and (C) close homologs with sequence identity >95% to capture possible conformational changes. The scatter plot data are summarized in the transition diagrams below; I are pairs in physical contact, II are pairs between 5 and 15 Å, and III are pairs separated by more than 15 Å; arrows indicate the frequency with which contacts at long distance shift to shorter distances, with thicker arrows corresponding to more probable transitions. Corresponding background rates are in Fig. S7 AC. Crystal structures exemplifying each source of exceptions are shown in D–F: (D) the homodimeric complex of the P5CR oxidoreductase (PDB entry 1YQG), (E) the α-mannosidase (hydrolase) (4AYO) along with five homologous structures (1DL2, 1NXC, 1HCU, 1X9D, 2RI9) overlayed with one another, and (F) the Sfp transferase with (white; 1QR0) and without (green; 4MRT, chain A) the PCP (cyan; 4MRT, chain C). Blue sticks in the structures indicate residue pairs that are in contact (dmin < 5 Å) in the query PDB file, and red sticks represent additional residue pairs that are adjacent at the homooligomeric interface (D), in homologous structures (E), and in the bound conformation of the Sfp protein (F). Full structures and corresponding contact maps are in Fig. S4.
Fig. S4.
Fig. S4.
Exemplar structures and contact maps for the three sources of exceptions. (A) The homodimeric complex of the P5CR oxidoreductase (one subunit is green and the other is white), (B) the α-mannosidase (green) and five homologous structures (white) overlaid with one another, and (C) the Sfp transferase with (white) and without (green) the PCP protein (cyan). Pale-blue dots above the diagonal on the contact maps indicate residue pairs that are in contact (dmin < 5 Å) in the query PDB file, and pale-red dots below the diagonal represent additional residue pairs that are adjacent at the homo-oligomeric interface (A), in homologous structures of α-mannosidase (B), and in the bound conformation of the Sfp protein (C). Corresponding GREMLIN contacts are highlighted in bright blue and bright red on the contact maps and shown by sticks in the structures. Orange stars represent directly coevolving residue pairs with dmin > 5 Å. See legend to Fig. 3 for more details.
Fig. S5.
Fig. S5.
Sources of direct coevolution couplings at large distances. (A) Contributions of homologous structures (both close and distant, green bars), interchain contacts in homooligomeric assemblies (yellow bars), and false contacts in repeat proteins (blue bars) into the explanation of exceptions in different ranges of the cutoff distance dmin are shown. (B) Cumulative contributions from the three major sources of exceptions at large distances. The total number of contacts at a given dmin (i.e., all contacts with dmin > threshold) are shown above the blue line. (C) Contributions to exceptions at intermediate (5 Å < dmin ≤ 15 Å; left bar) and long (dmin > 15 Å; right bar) distances, and (D) overall contributions to direct evolutionary couplings are shown (see legend to Fig. 4 for more details). All 0.5 × L top coevolving pairs predicted by GREMLIN for the set of 3,883 proteins (i.e., 444,296 pairs) were analyzed.
Fig. S6.
Fig. S6.
Transition probabilities for residue pairs picked at random. Arrows indicate the frequency with which a randomly selected long distance contact in a monomer shifts to shorter distances if (A) homo-oligomeric assemblies in the PDB biological unit, (B) homologous PDB structures detected by the HHsearch program, and (C) close homologs with sequence identity >95% are considered. I are pairs in physical contact, II are pairs between 5 and 15 Å, and III are pairs separated by more than 15 Å. See legend to Fig. 3.
Fig. S7.
Fig. S7.
Amino acid pair composition of coevolving residue pairs in MSAs. For every protein family, residue counts nabcoev (dmind < dmax) were calculated based on the entire MSA and subsequently normalized by the total number of sequences in the MSA. A–C correspond to distance ranges 0 < dmin ≤ 3 Å (A), 3 < dmin ≤ 5 Å (B), and 5 < dmin ≤ 10 Å (C), respectively. See Methods and legend to Fig. 2 for details.
Fig. 4.
Fig. 4.
With the exception of repeat proteins, directly coupled residue pairs in proteins are in direct physical contact. (A) Contributions from the three major sources of exceptions at intermediate (5 Å < dmin ≤ 15 Å; left bar) and long (dmin > 15 Å; right bar) distances are shown: repeat proteins (blue), homooligomeric interfaces (yellow), and homologs (both close and distant; green). Extrapolated contribution from homologs (white bars) is calculated based on the data from Fig. S4. The total number of contacts in each category is shown to the right of the corresponding bars. (B) Overall contributions to direct evolutionary couplings: Colors indicate residue pairs that are within 5 Å in the query PDB structure (blue), explained (yellow) and unexplained (red) exceptions. A subset of 235,644 directly coevolving pairs with GREMLIN scores > 0.5 were analyzed.
Fig. S8.
Fig. S8.
Effect of the number of available homologous structures on our ability to explain long-range contacts. With the increase in the number of homologs available for a given protein family in the PDB (x axis), the fraction of directly evolutionary coupled but distant in the query structure residue pairs progressively increases. Separate analysis of monomeric distances 5 Å < dmin ≤ 15 Å (magenta) and dmin > 15 Å (green) shows that their saturation levels are different and amount to ∼95% and ∼45%, respectively. The set of 3,883 high-resolution protein structures after exclusion of 99 repeat proteins was analyzed. Before calculating the fraction of long-range contacts corrected by homologs, possible homooligomeric contacts and contacts due to conformational changes were eliminated.
Fig. 5.
Fig. 5.
Coevolutionary direct coupling in allosterically regulated proteins is between spatially adjacent residues. (A) Crystal structure (PDB entry 1RX2) of the DHFR with a cofactor NADP+ (nicotinamide adenine dinucleotide phosphate, oxidized form) and a substrate molecule (folate); 14 putative allosteric sites from ref. are highlighted in magenta. (B) Crystal structure of the cathepsin K protein (PDB entry 1ATK) bound to an allosteric inhibitor through residues Tyr169 and Arg198 (in magenta). Catalytic dyad Cys25 and His162 are shown in spheres. The strongest GREMLIN contacts are shown as yellow (top 1–5), orange (top 6–10), and red (top 11–20) sticks in the structures. No residue pairs identified by GREMLIN are distant in the structure. Corresponding contact maps are in Fig. S9.
Fig. S9.
Fig. S9.
Contact maps for the DHFR (A) and cathepsin K (B) proteins. Gray dots represent actual contacts in PDB, and blue dots show top 0.5 × (protein length) directly coupled residue pairs identified by GREMLIN. See legend to Fig. 5 for more details.

Similar articles

Cited by

References

    1. Morcos F, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011;108:E1293–E1301. - PMC - PubMed
    1. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci USA. 2013;110:15674–15679. - PMC - PubMed
    1. Jones DT, Buchan DWA, Cozzetto D, Pontil M. PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28:184–190. - PubMed
    1. Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys. 2013;87:012707. - PubMed
    1. Marks DS, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011;6:e28766. - PMC - PubMed

Publication types

LinkOut - more resources