Lineage-specific mutational clustering in protein structures predicts evolutionary shifts in function

Bioinformatics. 2017 May 1;33(9):1338-1345. doi: 10.1093/bioinformatics/btw815.


Motivation: Spatially clustered mutations within specific regions of protein structure are thought to result from strong positive selection for altered protein functions and are a common feature of oncoproteins in cancer. Although previous studies have used spatial substitution clustering to identify positive selection between pairs of proteins, the ability of this approach to identify functional shifts in protein phylogenies has not been explored.

Results: We implemented a previous measure of spatial substitution clustering (the P3D statistic) and extended it to detect spatially clustered substitutions at specific branches of phylogenetic trees. We then applied the analysis to 423 690 phylogenetic branches from 9261 vertebrate protein families, and examined its ability to detect historical shifts in protein function. Our analysis identified 19 607 lineages from 5362 protein families in which substitutions were spatially clustered on protein structures at P3D < 0.01. Spatially clustered substitutions were overrepresented among ligand-binding residues and were significantly enriched among particular protein families and functions including C2H2 transcription factors and protein kinases. A small but significant proportion of branches with spatially clustered substitution also were under positive selection according to the branch-site test. Lastly, exploration of the top-scoring candidates revealed historical substitution events in vertebrate protein families that have generated new functions and protein interactions, including ancient adaptations in SLC7A2, PTEN, and SNAP25 . Ultimately, our work shows that lineage-specific, spatially clustered substitutions are a useful feature for identifying functional shifts in protein families, and reveal new candidates for future experimental study.

Availability and implementation: Source code and predictions for analyses performed in this study are available at:


Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Animals
  • Computational Biology / methods*
  • Evolution, Molecular*
  • Mutation*
  • Phylogeny*
  • Plants / genetics
  • Plants / metabolism
  • Protein Conformation
  • Proteins / genetics*
  • Proteins / metabolism
  • Proteins / physiology
  • Software*
  • Vertebrates / genetics
  • Vertebrates / metabolism


  • Proteins