Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 14 Suppl 3 (Suppl 3), S4

The SAAP Pipeline and Database: Tools to Analyze the Impact and Predict the Pathogenicity of Mutations

Affiliations

The SAAP Pipeline and Database: Tools to Analyze the Impact and Predict the Pathogenicity of Mutations

Nouf S Al-Numair et al. BMC Genomics.

Abstract

Background: Understanding and predicting the effects of mutations on protein structure and phenotype is an increasingly important area. Genes for many genetically linked diseases are now routinely sequenced in the clinic. Previously we focused on understanding the structural effects of mutations, creating the SAAPdb resource.

Results: We have updated SAAPdb to include 41% more SNPs and 36% more PDs. Introducing a hydrophobic residue on the surface, or a hydrophilic residue in the core, no longer shows significant differences between SNPs and PDs. We have improved some of the analyses significantly enhancing the analysis of clashes and of mutations to-proline and from-glycine. A new web interface has been developed allowing users to analyze their own mutations. Finally we have developed a machine learning method which gives a cross-validated accuracy of 0.846, considerably out-performing well known methods including SIFT and PolyPhen2 which give accuracies between 0.690 and 0.785.

Conclusions: We have updated SAAPdb and improved its analyses, but with the increasing rate with which mutation data are generated, we have created a new analysis pipeline and web interface. Results of machine learning using the structural analysis results to predict pathogenicity considerably outperform other methods.

Figures

Figure 1
Figure 1
PQS Residuein a contact with a different protein chain or ligand, according to the PQS server; Bind Residue in a contact with a different protein chain or ligand, according to PDB data; MMDB Residue in a contact with a ligand, according to the MMDB server; Gly Mutation from glycine, introducing unfavourable torsion angles; Pro Mutation to proline, introducing unfavourable torsion angles; Cispro Mutation from cisproline, introducing unfavourable torsion angle; Clash Mutation introducing a steric clash with an existing residue; Void Mutation introducing a destabilizing void >275Å3 in the protein core; Hbond Mutation disrupting a hydrogen bond; CorePhilic Introduction of a hydrophilic residue in the protein core; SurfacePhobic Introduction of a hydrophobic residue on the protein surface; BuriedCharge Mutation causing an unsatisfied charge in the protein core; SSgeom Mutation disrupting a disulphide bond; HighCons Residue has conserved sequence; EXPLAINED Any of the above categories. In addition, we look at whether a residue annotated as functionally relevant by UniProtKB/SwissProt; Asterisks indicate a significant result (two where p < 0.01 and one where p < 0.05) calculated as described in Hurst et al. [19].
Figure 2
Figure 2
Schematic indicating the two new terms used in evaluation of clashes. EvdW is the van der Waals energy evaluated using a standard Lennard-Jones potential while Eψ is a torsion energy.
Figure 3
Figure 3
Distribution of sidechain clash energies calculated according to Equation 1 for high resolution structures amongst CATH O-representatives.
Figure 4
Figure 4
Distribution of energies calculated according to Equation 1 for sidechain replacements classified as making 0-5 clashes using the old (Boolean) method. In the old method, 0, 1, or 2 clashes were considered not to form a bad clash while 3 or more clashes were considered to be a bad clash. In each plot, the shaded area shows those residues that were mis-classified according to the new energy-based criterion.
Figure 5
Figure 5
Ramachandran plots generated from high resolution structures. a) glycine, b) pro-line, c) other. Favoured regions are shown in progressively paler green while disfavoured regions are shown in red.
Figure 6
Figure 6
Results pages from the new SAAPdap pipeline. a) Summary and brief structural reports -- hovering over any of the titles brings up a box to explain the meaning of the effect; b) Expanded view of full structural analysis.
Figure 7
Figure 7
The penetrance of a mutation lies on a scale between 'True SNPs' which show no phenotypic effect at one extreme to Mendelianly inherited PDs with 100% penetrance at the other. In SAAPdb, we use a very conservative definition of PDs, but a rather wide definition of SNPs. In contrast, HumVar uses a somewhat broader definition of PDs, but a much more conservative definition of SNPs and does not consider mutations that lie in the middle.
Figure 8
Figure 8
Performance of the machine learning method trained on different sized sets of data from SAAPdb. In each case, a balanced dataset of the required size was extracted at random from the SAAPdb dataset of mutations mapped to protein chains (Table 2) and random forests were trained and tested using 10-fold cross-validation. The graph clearly shows that performance drops as the dataset size decreases, showing a marked drop in performance with datasets below 10,000 samples in size (5,000 SNPs and 5,000 PDs).

Similar articles

See all similar articles

Cited by 13 PubMed Central articles

See all "Cited by" articles

References

    1. Carr SM, Marshall HD, Duggan AT, Flynn SMC, Johnstone KA, Pope AM, Wilkerson CD. Phylogeographic Genomics of Mitochondrial DNA: Highly-resolved Patterns of Intraspecific Evolution and a Multi-species, Microarray-based DNA Sequencing Strategy for Biodiversity Studies. Comp Biochem Physiol Part D Genomics Proteomics. 2008;3:1–11. doi: 10.1016/j.cbd.2006.12.005. - DOI - PubMed
    1. Bentley DR. Whole-genome Re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. doi: 10.1016/j.gde.2006.10.009. - DOI - PubMed
    1. Wang P, Dai M, Xuan W, McEachin RC, Jackson AU, Scott LJ, Athey B, Watson SJ, Meng F. SNP Function Portal: a web Database for Exploring the Function Implication of SNP Alleles. Bioinformatics. 2006;22:e523–e529. doi: 10.1093/bioinformatics/btl241. - DOI - PubMed
    1. Yue P, Melamud E, Moult J. SNPs3D: Candidate gene and SNP Selection for Association Studies. BMC Bioinformatics. 2006;7:166–166. doi: 10.1186/1471-2105-7-166. - DOI - PMC - PubMed
    1. Uzun A, Leslin CM, Abyzov A, Ilyin V. Structure SNP (StSNP): a web Server for Mapping and Modeling nsSNPs on Protein Structures with Linkage to Metabolic Pathways. Nucleic Acids Res. 2007;35:W384–W392. doi: 10.1093/nar/gkm232. - DOI - PMC - PubMed

Publication types

Feedback