Improving functional annotation of non-synonomous SNPs with information theory

Pac Symp Biocomput. 2005;397-408. doi: 10.1142/9789812702456_0038.


Automated functional annotation of nsSNPs requires that amino-acid residue changes are represented by a set of descriptive features, such as evolutionary conservation, side-chain volume change, effect on ligand-binding, and residue structural rigidity. Identifying the most informative combinations of features is critical to the success of a computational prediction method. We rank 32 features according to their mutual information with functional effects of amino-acid substitutions, as measured by in vivo assays. In addition, we use a greedy algorithm to identify a subset of highly informative features. The method is simple to implement and provides a quantitative measure for selecting the best predictive features given a set of features that a human expert believes to be informative. We demonstrate the usefulness of the selected highly informative features by cross-validated tests of a computational classifier, a support vector machine (SVM). The SVM's classification accuracy is highly correlated with the ranking of the input features by their mutual information. Two features describing the solvent accessibility of "wild-type" and "mutant" amino-acid residues and one evolutionary feature based on superfamily-level multiple alignments produce comparable overall accuracy and 6% fewer false positives than a 32-feature set that considers physiochemical properties of amino acids, protein electrostatics, amino-acid residue flexibility, and binding interactions.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Analysis of Variance
  • Bacteriophage T4 / genetics
  • Base Sequence
  • Biological Evolution*
  • Databases, Nucleic Acid
  • Markov Chains
  • Models, Genetic
  • Mutation
  • Polymorphism, Single Nucleotide / genetics*