Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 21;12(12):e1005294.
doi: 10.1371/journal.pcbi.1005294. eCollection 2016 Dec.

Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations

Affiliations

Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations

Andrew F Neuwald et al. PLoS Comput Biol. .

Abstract

Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes' theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu).

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Residues, mostly of unknown function, that are highly conserved in putative orthologs of human Naa10 acetylase from metazoans (phyla indicated in red), fungi (brown), protozoans (cyan), and plants (green).
Fig 2
Fig 2. hiMSA creation and analysis.
A. Flow chart showing the steps required to create and interpret a hiMSA (as described in text). B. Schematic of a BPPS-generated “contrast alignment” that corresponds to node 8 of the hierarchy in (A). One such contrast alignment is created for each node in the hierarchy. Sequences assigned to node 8’s subtree (blue nodes in (A)) constitute a ‘foreground’ partition, those assigned to the most closely related nodes (red nodes in (A)) constitute a ‘background’ partition, and the remaining sequences constitute a non-participating partition. Horizontal bars represent sequences assigned to the similarly-colored corresponding nodes in (A). Blue vertical bars represent conserved foreground residue patterns (as shown below each bar); these diverge from (or contrast with) the background compositions at those positions (white vertical bars). Red vertical bars above the alignment quantify the degree of divergence. C. BPPS sampling explores the space of domain hierarchies by attaching or removing leaf nodes, moving subtrees, inserting or deleting internal nodes, moving sequences between nodes and, for each subtree, adding or deleting residue patterns based on how well they discriminate the foreground from the background (as shown in (B)). D. Schematic diagram of a hiMSA from the perspective of a leaf node. One such diagram could be created for each node in a hierarchy. (center) The node 6 lineage of the full hiMSA. Horizontal lines represent aligned sequences and are color-coded by level in the hierarchy. Thin light gray horizontal lines represent non-homologous and deleted regions. Vertical lines represent the contrasting pattern positions upon which the hierarchy is based and are similarly color-coded by levels. (left & right sides) Subtrees corresponding to each level. The colored, gray and white nodes in each tree correspond, respectively, to their alignment foreground, background and non-participating partitions, the sequences of which are colored similarly. The background for the entire superfamily (lower right) consists of random sequences.
Fig 3
Fig 3. Hierarchy and key features of the acetylase superfamily.
A. The acetylase hierarchy identified by the sampler. For clarity smaller subtrees not discussed in the text have been omitted; the complete hierarchy is given in S1.3 Fig. Purple nodes are not discussed in the text. B. Root node “contrast alignment” highlighting conserved patterns most characteristic of acetylases as a whole. Shown are six representative sequences assigned to node-13 of the acetylase hierarchy in (A). These sequences correspond to an uncharacterized prokaryotic acetylase family that conserves all of the root node canonical residues. The sequences are labeled by their bacterial phyla except for the first (proteobacterial) sequence, the structure of which is shown in (C). Below the representative alignment is a summary of the most conserved amino acid residues at each position; the number of sequences (assigned to the foreground) is given in parentheses on the first line. The 1st to 3rd lines show up to three residues at each position that occur both most frequently and in ≥10% of the sequences. Directly below this, the frequencies of the designated residues are given in integer tenths; for example, an ‘8’ indicates that 80–90% of the sequences in the foreground alignment match the corresponding pattern residue. In column 88, for example, glycine occurs in 60–70% and alanine in 20–30% of the sequences. To highlight larger integers ‘5’ and ‘6’ are shown in black and ‘7’-‘9’ in red. The first of these lines (labeled as “wt_res_freqs” for “weighted residue frequencies”) reports the effective number of aligned sequences. In all of these cases, reported frequencies have been down-weighting for redundancy. The black dots above the alignment indicate the pattern positions that were identified by the sampler. Pattern-matching (correlated) residues are highlighted in color, with biochemically similar residues colored similarly. For example, acidic residues are shown in red, basic residue in cyan and hydrophobic residues in yellow; histidine, glycine and proline are each assigned a unique color. The height of the red bars above the alignment quantify (using a semi-logarithmic scale) the degree to which residue frequencies in the foreground diverge at each position from the corresponding positions in the background. In this case, the foreground corresponds to the root node, that is, to the entire tree and thus to all acetylases, and the background corresponds to all proteins unrelated to acetylases, which is represented by standard amino acid residue frequencies. C. The acetylase fold with canonical residues most characteristic of the superfamily. The structure show is that of an E. coli putative N-acetyltransferase assigned to node 13; the corresponding sequence to the first aligned in (B) (pdb_id: 2kcw). D-F. Residue positions likely responsible for acetylase functional specificity. D. Histogram of normalized average ∆-BILD scores over all column positions. Scores were linearly adjusted so that the lowest score is zero and the highest score is 100. Data points with scores greater than 50 are plotted above the histogram and are spread out vertically to avoid overlap. Histogram bars that are more than two standard deviations above the mean are colored red; corresponding data points are color coded (as explained in text) and enlarged to enhance visibility. Numbers next to data points correspond to the positions of the corresponding aligned columns within the main alignment (i.e., the root node alignment) shown in S4 Fig. E. Surface representation of the substrate binding pocket showing the locations of six of the residues in (F), which are color coded and numbered as in (D). See text for further details. F. Locations within the crystal structure of Pseudomonas syringae tabtoxin resistance protein complexed with acyl-CoA (pdb_id: 1gheB)[82] of the nine residues corresponding to the rightmost data points in (D); this protein was assigned to node 104. Residue sidechains are colored as are the data points in (D) and labeled by column positions in the core alignment. In addition, four consensus amino acid residues generally conserved in acetylases are shown in yellow; acyl-CoA is shown in cyan.
Fig 4
Fig 4. Correlated residue patterns associated with the node 12 lineage.
See the legend to Fig 3B for an explanation of notation. The same representative node-12-assigned sequences are shown in both A and B, but highlight pattern residues most distinctive of the root (i.e., the superfamily) and of the node 12 subgroup, respectively. A. Contrast alignment corresponding to the root of the acetylase hierarchy. B. Contrast alignment corresponding to node 12 of the hierarchy. Here the foreground corresponds to sequences assigned to node-12 and the background to all other acetylase sequences. The highlighted columns below the red dots correspond to the residues shown in Fig 5B; note that the constraints imposed on these residues (i.e., the heights of the red bars above the dots) are generally higher than the constraints imposed on the other pattern residues.
Fig 5
Fig 5. The Caenorhabditis elegans glucosamine-6-phosphate N-acetyltransferase (Gna1) complexed with CoA and N-acetylglucosamine-6-phosphate (GlcNAc6p) (pdb_id: 4ag9) [84].
Gna1 was assigned to node 12 of the hierarchy. A. Structural locations of acetylase residues (yellow) and node 12-specific residues (red). B. Node 12-specific residues involved in substrate binding. These residue positions are indicated in Fig 4(as red dots above column positions). C. Node 12-specific residues associated with the homodimeric interface.
Fig 6
Fig 6. Pattern residues associated with the node 42 lineage.
A. The structural locations of pattern residues corresponding to nodes 35, 40 and 42 (the sidechains of which are shown in red, orange and yellow, respectively) of the apo form of a putative acetylase from Salmonella typhimurium (pdb_id: 3dr6). This protein forms a homodimer, the two subunits of which are shown in blue and gray. Residues are shown within the bottom subunit of the homodimeric complex only. B. Pattern residues associated with interactions between dimeric subunits in the apo form (pdb_id: 3dr6). C. Pattern residues that line the substrate binding pocket within the CoA-bound form of the same acetylase as in (A) (pdb_id: 3dr8). CoA is shown in cyan. D. The corresponding surface plot of the pocket using the same color scheme.
Fig 7
Fig 7. Node 35 structural features implicated in a proposed induced-fit mechanism.
A-C. Conformational changes that involve two conserved residues and that mediate opening and closing of the substrate binding pocket of a putative acetylase from Salmonella typhimurium with and without bound acetyl-CoA (pdb_id: 3dr8 and 3dr6). A. (top) Surface view of the open conformation when CoA is bound to both subunits (pdb_id: 3dr8). The surface of Glu82 is shown in red, of Arg72 in yellow and of the rest of the substrate binding pocket (SBP) in green. (bottom) Close up view of Glu82 and Arg72 at the dimeric interface (pdb_id: 3dr8); the SBP is indicated. B. (top) Surface view of the closed conformation when neither subunit is bound to CoA (pdb_id: 3dr6). Note that the substrate binding pocket appears inaccessible. (bottom) Close up of the Glu82-Arg72 salt bridge formed at the subunit interface. C. (top left) Surface side view of the acetyl-CoA bound form (pdb_id: 3dr8) showing the locations of a cluster of Set42-specific residues (shaded yellow). CoA is shown in cyan. (bottom) The same view of 3dr8 as in A but rotated by 90 degrees to show a side view. The expanded box shows the node-42 pattern residue interactions forming a bridge between adjacent loop regions. (top right) A similar view of C. Elegans Gna1 (pdb_id: 4ag9) showing the adjacent locations of the CoA and substrate within a channel rather than a pocket as in (A). D-F. Differences between node 40 and node 36 pattern residues. Residues with red sidechains correspond to node 35 pattern residues. See S3 Fig for the contrast alignment showing pattern residues. D. Node 40 pattern residues (orange sidechains) within 3dr6 (an acetylase assigned to node 42). E. Node 40 pattern residues (orange sidechains) within 1vhs (an acetylase assigned to node 41). F. Node 36 pattern residues (light green sidechains) within 4jxr (an acetylase assigned to node 39).
Fig 8
Fig 8. Opening and closing of the substrate binding pocket and movement of helix 2 based on superpositioning of bound and unbound acetylase domains.
Sidechains of the putative induced fit glutamate and arginine residues in the closed form are shown as red and yellow sticks, respectively. CoA within half-bound homodimers is shown as cyan colored sticks. A. Node 42 CoA-bound, CoA-unbound and CoA-half-bound structures. When bound to CoA, helix-2 moves outward relative to the unbound homodimer. The sidechain of the 3dr8 arginine residues (open conformation) for the 3dr8 vs 3dr6 superposition is shown as light blue sticks (compare with Fig 7A). B. Node 35 hierarchy with color coding. C. Node 41 unbound monomeric structure superimposed over the node 42 CoA-bound homodimer. Note that the proposed induced-fit arginine residue is replaced by a tyrosine, which could also form both a hydrogen bond to the glutamate residue and a π- π stacking interaction with this tyrosine from the other homodimeric subunit. D. Node 39 superpositions showing that, unlike sequences assigned to the node 40 subtree, helix 2 does not appear to move outward upon binding to CoA. This may be due to differences between node 40 and node 36 pattern residues associated with this helix (see Fig 7D–7F).

Similar articles

Cited by

References

    1. Mendel G. Versuche über Pflanzen Hybriden. Verhandlungen des Naturforschenden Vereines Brünn. 1866;4:3–47.
    1. Arnesen T, Anderson D, Baldersheim C, Lanotte M, Varhaug JE, Lillehaug JR. Identification and characterization of the human ARD1-NATH protein acetyltransferase complex. Biochem J. 2005;386(Pt 3):433–43. PubMed Central PMCID: PMCPMC1134861. 10.1042/BJ20041071 - DOI - PMC - PubMed
    1. Parliament MB. Radiogenomics: associations in all the wrong places? Lancet Oncol. 2012;13(1):7–8. 10.1016/S1470-2045(11)70331-X - DOI - PubMed
    1. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124 PubMed Central PMCID: PMCPMC1182327. 10.1371/journal.pmed.0020124 - DOI - PMC - PubMed
    1. Hayat S, Sander C, Marks DS, Elofsson A. All-atom 3D structure prediction of transmembrane beta-barrel proteins from sequences. Proc Natl Acad Sci U S A. 2015;112(17):5413–8. PubMed Central PMCID: PMC4418893. 10.1073/pnas.1419956112 - DOI - PMC - PubMed

Publication types

Substances

Grants and funding

SFA was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. AFN received no specific funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.