Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2019 Jan 1;35(1):12-19.
doi: 10.1093/bioinformatics/bty523.

The Choice of Sequence Homologs Included in Multiple Sequence Alignments Has a Dramatic Impact on Evolutionary Conservation Analysis

Affiliations
Free PMC article

The Choice of Sequence Homologs Included in Multiple Sequence Alignments Has a Dramatic Impact on Evolutionary Conservation Analysis

Nelson Gil et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

Results: The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1.
Fig. 1.
Agreement between functional sites defined using different methods. This plot shows the agreement between BioLiP and two different approaches, FireDB (red line) and Ligand Protein Contacts (LPC; blue line, solid), in defining ligand-binding site residues for miscellaneous-molecule-binding proteins. Also shown is the agreement between BioLiP and Contacts of Structural Units (CSU; blue line, dashed) in defining the ligand-binding residues of oligopeptide-binding proteins
Fig. 2.
Fig. 2.
Sequence conservation-based binding site residue predictions can approach the practical upper limit of binding site definition. Conservation analysis F-Score distributions for the three datasets used in this work (A: miscellaneous-molecule-binding, B: oligopeptide-binding, C: oligonucleotide-binding) are plotted for the Max Sampled MSAs (blue), SAMMI MSAs (green) and randomly selected MSAs (red). Specifically, the red curve was obtained by plotting the distribution of average F-Scores resulting from conservation analysis of 100 randomly selected sampled MSAs for each protein. In addition, a background F-Score distribution (black) representing the likelihood of selecting functional residues by chance was obtained by randomly selecting a number of residues (with relative solvent-accessibility greater than zero) equal to the number of annotated functional residues for each protein and averaging the resulting F-Score over 264 trials. The dashed lines at F-Scores of ∼0.8 indicate the approximate theoretical upper limit that was established by analysis of the agreement between different databases. The plotted density functions are accurate representations of the underlying data (Supplementary Fig. 2)
Fig. 3.
Fig. 3.
Modern functional residue prediction methods are equivalent to conservation analysis on a randomly selected MSA. Reported F-Scores of representative state-of-the-art methods for predicting enzyme catalytic and ligand-binding residues, overlain on the F-Score distributions obtained for the miscellaneous-molecule-binding dataset in this work. The mean reported F-Scores of methods 2, 3, 4 and 5 are all between 0.25 and 0.30, while that of methods 6, 7 and 8 are ∼0.35. Although the performances of each method were calculated based on different benchmark datasets, they exhibit similar results and appear roughly equivalent to using conservation analysis on a randomly selected MSA. All methods involve statistical analysis of sequence and/or structural features, mostly by using machine learning approaches. Brief descriptions the methods and their reported performances are listed in Table 1
Fig. 4.
Fig. 4.
Statistical significance of SAMMI MSA F-Scores obtained for the set of 275 misc.-molecule-binding proteins. Each point represents the P-value obtained for an individual protein’s SAMMI F-Score. The plot shows that P-values drop sharply after the F-Score reaches 0.20. The inset shows the negative base 10 logarithm of the P-values and demonstrates a corresponding monotonically rising trend. The cluster of P-values at F-Scores of 0 indicates the statistical significance of these predictions as well; for these cases, it is likely that the SAMMI MSA conservation pattern detects a secondary binding site or unannotated functional residues. The dashed lines indicate P-values of 0.05 and 0.01 (-log10 applied in inset). Similar plots were obtained for the oligopeptide-binding and oligonucleotide-binding datasets. Plotting the P-values of random MSA selection F-Scores would also show a similar drop at an F-Score value of 0.25
Fig. 5.
Fig. 5.
Example of functional residue prediction of CD38 extracellular domain (PDB 3dzhA; gray cartoon) using an ‘average-performing’ MSA representing random MSA selection (A; F-Score = 0.3), the SAMMI-selected MSA (B; F-Score = 0.5) and the Max Sampled MSA (C; F-Score = 0.6). Balls represent residues that either are BioLiP-annotated to bind or are predicted to bind to the GTP (blue stick model), an inhibitory ligand of CD38. True positive, false negative and false positive residue atoms are respectively represented by green, yellow and red balls

Similar articles

See all similar articles

Cited by 3 articles

Publication types

Feedback