Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 May;14(5):1305-14.
doi: 10.1110/ps.041187405.

Structural Similarity to Bridge Sequence Space: Finding New Families on the Bridges

Affiliations
Free PMC article

Structural Similarity to Bridge Sequence Space: Finding New Families on the Bridges

Parantu K Shah et al. Protein Sci. .
Free PMC article

Abstract

Structures for protein domains have increased rapidly in recent years owing to advances in structural biology and structural genomics projects. New structures are often similar to those solved previously, and such similarities can give insights into function by linking poorly understood families to those that are better characterized. They also allow the possibility of combing information to find still more proteins adopting a similar structure and sometimes a similar function, and to reprioritize families in structural genomics pipelines. We explore this possibility here by preparing merged profiles for pairs of structurally similar, but not necessarily sequence-similar, domains within the SMART and Pfam database by way of the Structural Classification of Proteins (SCOP). We show that such profiles are often able to successfully identify further members of the same superfamily and thus can be used to increase the sensitivity of database searching methods like HMMer and PSI-BLAST. We perform detailed benchmarks using the SMART and Pfam databases with four complete genomes frequently used as annotation benchmarks. We quantify the associated increase in structural information in Swissprot and discuss examples illustrating the applicability of this approach to understand functional and evolutionary relationships between protein families.

Figures

Figure 1.
Figure 1.
Merging sequences in folds space to find the bridging families. (A) Representation of fold space where related sequences are grouped into families (or higher-level groupings). Families related by structure (and assumed evolutionary relationship) can be merged to produce a powerful profile, which in turn can be utilized to find the sequences that occur at the “bridges.” Ovals of different colors represent different sequence families that populate fold space. Proteins of known structures are shown as stars; those without, as circles. (B) The strategy that we have used to identify bridging families. We use sequence information from SMART/Pfam and structural hierarchy of SCOP to merge different related families in to new superfamilies. We then use PSI-BLAST or HMMer profiles of the merged superfamily to search the sequence databases to identify the bridging families that may be part of the same superfamily.
Figure 2.
Figure 2.
Benchmarks with SMART and Pfam. Plots of Specificity (upper left) and Sensitivity (lower left) vs. BLAST/HMMer E-value (log scale). Labels indicate the type of profiles used: “Fold,” Pfam/SMART domains merged with structures sharing the same fold, but lying in different superfamilies; “Superfamily,” structures in the same superfamily; “Unmerged” separate Pfam/SMART domains. The vertical broken lines show the thresholds for HMMer (blue) and BLAST (red) chosen in the text to give the optimal results when searching. ROC curve (right) is also shown with the same labels. Curves for Unmerged profiles for Specificity are horizontal on top and sideways for ROC curves and may not be easily visible.
Figure 3.
Figure 3.
Examples of similarities found using merged profiles. (A) A typical example from our profile searches using mergers of SMART domains of Helix-Turn-Helix fold. Domains with white boxes are from the starting sets. The solid black bars represent merged families. Domains in gray ovals are structurally uncharacterized and those in white ovals are structurally characterized (but not included in starting set). The gray dotted lines starting from the black bars represent the merged pair that brings out the “bridging” family. The relationships described here are a complex web where the same resultant families may be picked up by more than one different profile. (B) A similar figure for mergers of different Pfam families of the Methyltransferase fold.
Figure 4.
Figure 4.
Assignment and benchmarking on complete genomes. (A) Difference in number of fold assignments done using HMMer searches of merged profiles compared to the assignments of the same fold using methods like superfamily, BLAST and AnDom on full genomes of M. genitalium, E. coli K12, S. pneumoniae R6, and S. cerevisiae. (B) The assignment differences obtained with our profiles while searching genomes with PSI-BLAST.
Figure 5.
Figure 5.
Current level of structural annotation of sequence databases. Progression (left to right) showing how the number of links to known structure in Swissprot increases as more sensitive methods are used. PDB shows those proteins of known structure; HSSP augments these with their close homologs; “SMART + Pfam” are links added by matches to domains themselves linked to structures; “PB or Andom” increases this further via PSI-BLAST or Andom assignments; “Unique” are those assigned by our seven new domains found with merged profiles (Table 2).

Similar articles

See all similar articles

Cited by 4 articles

LinkOut - more resources

Feedback