Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 2;12(6):e1004817.
doi: 10.1371/journal.pcbi.1004817. eCollection 2016 Jun.

Evolution-Based Functional Decomposition of Proteins

Affiliations

Evolution-Based Functional Decomposition of Proteins

Olivier Rivoire et al. PLoS Comput Biol. .

Abstract

The essential biological properties of proteins-folding, biochemical activities, and the capacity to adapt-arise from the global pattern of interactions between amino acid residues. The statistical coupling analysis (SCA) is an approach to defining this pattern that involves the study of amino acid coevolution in an ensemble of sequences comprising a protein family. This approach indicates a functional architecture within proteins in which the basic units are coupled networks of amino acids termed sectors. This evolution-based decomposition has potential for new understandings of the structural basis for protein function. To facilitate its usage, we present here the principles and practice of the SCA and introduce new methods for sector analysis in a python-based software package (pySCA). We show that the pattern of amino acid interactions within sectors is linked to the divergence of functional lineages in a multiple sequence alignment-a model for how sector properties might be differentially tuned in members of a protein family. This work provides new tools for studying proteins and for generally testing the concept of sectors as the principal units of function and adaptive variation.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist

Figures

Fig 1
Fig 1. Three representations of a multiple sequence alignment comprised of M sequences and L positions.
A, ascii text. B, a three-dimensional binary array xsia, in which xsia=1 if sequence s has amino acid a at position i, and 0 otherwise; gaps are always set to 0. In this representation, the frequencies of amino acids at individual positions are fia=xsiasswsxsia/M, where ws is the weight for each sequence s and M′ = ∑s ws represents the effective number of sequences in the alignment. Joint frequencies of amino acids between pairs of positions are defined by fijab=xsiaxsjbsswsxsiaxsjb/M. C, a two-dimensional alignment matrix Xsn, in which the index s (along rows) represents sequences and the index n (along columns) represents the combination of amino acid and position dimensions in one, such that n = 20(i − 1) + a. This representation is useful in explaining the relationship between patterns of coevolution between amino acids and patterns of sequence divergence in the protein family (see Eq (12)).
Fig 2
Fig 2. Dia, the measure of amino acid conservation.
A, A plot of Dia as a function of fia, the amino acid frequency, and qa, the background frequency here for illustration set to 0.05. See the Supplementary Information for actual values of q.
Fig 3
Fig 3. Positional conservation (Di) and the SCA weighted correlation matrix C˜ij for the G protein family.
A-B, The overall positional conservation Di for the G protein alignment, and a corresponding mapping on a slice through the core of the atomic structure of a representative member of the family (human Ras, PDB 5P21). The data show that the top 50% of conserved positions (in red) lie at functional surfaces and within the solvent inaccessible core. Thus, positional conservation maps to an intuitive and a well-known decomposition of protein structures. C-D, C˜ij ordered by primary structure (C), and after hierarchical clustering (D). The data describe a sparse and seemingly hierarchical organization of correlations—a general result for most protein families.
Fig 4
Fig 4. Spectral decomposition and ICA.
A-B, The eigenspectrum of C˜ij (in black bars) for the G protein (A) and S1A (B) protein families. The eigenvalue distribution expected randomly is shown in red and provides a statistical basis for defining the k* top eigenmodes for further analysis—conservatively, those greater than the second random eigenvalue. The first random eigenvalue is ignored since it is a trivial consequence of retaining the independent conservation of sites in the randomization process [10]. This analysis suggests k* = 4 and k* = 7 for the G and S1A families, respectively. C-D, The top three eigenvectors for the G (C) and S1A (D) families suggest the possibility of distinct groups of coevolving positions, but illustrates the property that these groups emerge along combinations of eigenmodes. E-F, Independent components analysis (ICA) optimizes the independence of groups emerging along the different directions, putting the top three groups of amino acids on nearly orthogonal axes. The group of positions contributing to each IC groups is defined by fitting an empirical statistical distribution to the ICs and choosing positions above a defined cutoff (default, > 95% of the CDF). Groups of positions in panels C-F are defined and colored accordingly.
Fig 5
Fig 5. The mathematical relationship between sequence and positional correlations.
A, A binary matrix representation of the alignment Xsn, comprised of M sequences by 20 × L amino acids (Fig 1C); the equation shows the singular value decomposition (SVD) of X (Eq (12)). From the alignment matrix, two correlation matrices can be computed: S, a correlation matrix over rows (B) describing relationships between sequences, and F, a correlation matrix over columns (C) describing relationships between amino acids; equations show the eigenvalue decompositions of these matrices. By the SVD, X provides a mapping between the two such that the eigenvectors of F (in V) correspond to the eigenvectors of S (in U). Thus, it is possible to associate coevolving groups of amino acids to patterns of sequence divergence in the alignment. As described in the text, a similar mapping is possible for positional (rather than amino acid specific) coevolution (Eq (14)).
Fig 6
Fig 6. IC-based sequence divergences in the S1A protein family.
The panels show scatterplots of sequences in the G protein alignment along dimensions (U˜16p) that correspond to sequence variation in positions contributing to each of the top six ICs of the SCA coevolution matrix. The mapping between positional coevolution to sequence relationships is achieved using the reduced alignment matrix x, as per Eqs (14) and (15). Sequences are colored either by enzymatic activity (A-C, the haptoglobins are non-catalytic members of the S1A family), annotated catalytic specificity (D-F), or taxonomic origin (G-I). For each graph, the stacked histograms show the distributions of these classifications for each dimension. Note that trypsin, tryptase, kallikreins, and certain granzymes have tryptic specificity, and chymotrypsin and most granzymes have chymotryptic specificity. The data show that IC1 specifically separates sequences by enzymatic activity (A), IC2 separates sequences by catalytic specificity (D), IC3 separates sequences by invertebrate/vertebrate origin (H), and ICs 4–6 show more minor variations by catalytic specificity (E-F). These data (1) recapitulate and extend previous observations [10], and (2) demonstrate the functional relevance of the IC-based decomposition.
Fig 7
Fig 7. IC-based sequence divergences in the G protein family.
The panels show scatterplots of sequences in the G protein alignment along dimensions (U˜14p) that correspond to sequence variation in positions contributing to each of the four ICs of the SCA coevolution matrix. The mapping between positional coevolution to sequence relationships is achieved using the reduced alignment matrix x, as per Eqs (14) and (15). Sequences are colored either by annotated functional sub-type of G protein (A-B) or by taxonomic origin (C-D), and the stacked histograms show the distributions of these classifications for each dimension. The data show that ICs 1 and 2 (A) correspond to distinct sequence divergences of functional subtypes of G protein; for example, IC1 separates the Rho proteins (green) along U˜1p, and IC2 separates the Rho proteins (green) and a subset of Ras proteins (red) along U˜2p. In contrast, IC3 and IC4 are homogenous with regard to G protein subtype (B), and all ICs are essentially homogeneous with regard to phylogenetic divergence (C-D). These data suggest that IC3 and IC4 are nearly homogeneous features of the G protein family, while IC1 and IC2 are differentially selected for more specialized properties of G protein subtypes.
Fig 8
Fig 8. IC-based decomposition and positional conservation.
Panels A-B show stacked histograms of positional conservation (Di) for the S1A and G protein families, respectively, with positions corresponding to different ICs marked in color as indicated. The data show that consistent with conservation-based weighting, positions contributing to the top ICs tend to be more conserved than average, but that the distinction between ICs cannot be made by just magnitude of positional conservation. Thus the IC-based decomposition of sequences is uniquely a property of analyzing correlations.
Fig 9
Fig 9. Sector identification for the G protein family.
A shows the IC-based sub-matrix of the C˜ij matrix for the G protein family and and B-C shows the structural interpretations on a representative member of the family (H-Ras, PDB 5P21 [48]). IC4 represents a nearly independent group of coevolving positions (sector 2, red), while ICs 1, 2, and 3 show strong inter-IC correlations that suggest classification as a single hierarchically-organized sector (sector 1, different shades of blue). B, Structurally, sector 1 comprises the nucleotide binding pocket (IC1) and the connection to so-called switch domains 1 and 2 which interact with downstream target proteins (ICs 2 and 3). Together, these regions correspond to the known allosteric mechanism in the G protein family. Sector 2 corresponds to a distinct, largely contiguous group of amino acids with yet unclear functional role. C, The three ICs comprising sector 1 mapped on the atomic structures of the active GTPγS bound state (PDB 5P21 [48], left panels) and inactive GDP-bound state (PDB 4Q21 [49], right panels) of H-ras. The data show that ICs 1 and 2 show substantial state-dependent conformational change. These same ICs also show distinct patterns of variation along different G protein sub-types (Fig 7A), suggesting that variations in these ICs tunes allosteric or substrate binding properties.

Similar articles

Cited by

References

    1. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973. July;181(4096):223–30. 10.1126/science.181.4096.223 - DOI - PubMed
    1. Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996. March;257(2):342–58. 10.1006/jmbi.1996.0167 - DOI - PubMed
    1. Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999. October;286(5438):295–9. 10.1126/science.286.5438.295 - DOI - PubMed
    1. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A. 2011. December;108(49):E1293–301. 10.1073/pnas.1111471108 - DOI - PMC - PubMed
    1. Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, et al. Rewiring the specificity of two-component signal transduction systems. Cell. 2008. June;133(6):1043–54. 10.1016/j.cell.2008.04.040 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances