Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 May;11(5):1101-16.
doi: 10.1110/ps.3950102.

Structural Similarity to Link Sequence Space: New Potential Superfamilies and Implications for Structural Genomics

Affiliations
Free PMC article

Structural Similarity to Link Sequence Space: New Potential Superfamilies and Implications for Structural Genomics

Patrick Aloy et al. Protein Sci. .
Free PMC article

Abstract

The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.

Figures

Fig. 1.
Fig. 1.
Significant links (P3D ≤5×10−3) at fold level between SMART domains identified by the method discussed in the text. Thick continuous lines indicate MP3D ≤5×10−3 and dashed lines MP3D >5×10−3.
Fig. 2.
Fig. 2.
(a) Molscript (Kraulis 1991) figures showing Staphylococcus aureus nuclease (staphylococcal nuclease; left; PDB code 1kdc) and Escherichia coli S1 RNA-binding protein (right; 1sro) in a similar orientation. Structural equivalent regions (identified by the method of Russell and Barton 1992) are labeled with arrows (β-strands) or ribbons (α-helices) or coils, with nonequivalent regions shown as Cα trace. Residues common to both structures are shown in ball-and-stick format. The coloring scheme moves through the spectrum from blue to red from N- to C-terminus. Linkage details: RMSD = 2.0 Å in 38 Cα atoms; 13 identities in 32 equivalent residues; P3D-value = 4.7×10−6 MP3D = 8.2×10−5. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequence is also shown for the nucleases (NUC_SFLX; S. flexneri nuclease); the best link was between this sequence and the E. coli S1 RNA-binding domain. Positions within the alignment showing conservation of residue character are colored as follows: yellow background, conserved hydrophobic; blue background, conserved small; red text, conserved polar. Identical residues are boxed. Secondary structures are shown as arrows (β-strands) and cylinders (α-helices) below the alignment and colored as for a.
Fig. 3.
Fig. 3.
Significant links at fold level between Pfam domains identified by the method discussed in the text. Lines are drawn as for Figure 1 ▶.
Fig. 4
Fig. 4
(a) Molscript (Kraulis 1991) figures showing the Poly(A)-binding protein (left; 1cvj, chain F) and the Copper transporter ATPase (right; 1aw0) in a similar orientation. Details for the figures are as for Figure 2 ▶. Linkage details: RMSD = 1.8 Å in 39 Cα atoms; seven identities in 12 equivalent residues; P3D-value = 4.7×10−5 MP3D = 9.1×10−4. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequences are also shown (COPA_ENTHR and RO33_NICSY). Conserved positions and secondary structures are shown as described in Figure 2 ▶.
Fig. 5.
Fig. 5.
(a) Molscript (Kraulis 1991) figures showing the N-terminus of transit peptide H protein of the Gly cleavage system (left; 1hpc, chain a) and the C-terminus of glucose permase domain IIA (right; 1gpr) in a similar orientation. Details for the figures are as for Figure 2 ▶. Linkage details: RMSD = 1.4 Å in 45 Cα atoms; 15 identities in 45 equivalent residues; P3D-value = 1.3×10−4 MP3D = 2.5×10−3. (b) Alscript (Barton 1993) figure showing the structural alignment of the two structures in a with secondary structures (below). The best linking sequences are also shown (O59049 and PTBA_ERWCH). Conserved positions and secondary structures are shown as described in Figure 2 ▶. The numbers within the alignment denote the start and end of the aligned segments (note, in particular, that 1hpc is permuted relative to 1gpr). (c) Figure showing the similarity in a topology diagram. β-strands are denoted as triangles; α-helices as circles, colored in an analogous fashion to a.

Similar articles

See all similar articles

Cited by 5 articles

Publication types

Substances

LinkOut - more resources

Feedback