Enumerating and ranking discrete motifs

Proc Int Conf Intell Syst Mol Biol. 1997:5:202-9.

Abstract

Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a technique that infers motifs from aligned protein sequences by exhaustively searching this space. Our method generates sequence motifs over a wide range of recall and precision, and chooses a representative motif based on a score that we derive from both statistical and information-theoretic frameworks. Finally, we show that the selected motifs perform well in practice, classifying unseen sequences with extremely high precision, and infer protein subclasses that correspond to known biochemical classes.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Amino Acids / chemistry
  • Artificial Intelligence
  • Databases, Factual
  • Molecular Sequence Data
  • Protein Conformation*
  • Proteins / chemistry
  • Proteins / classification
  • Proteins / genetics
  • Sequence Alignment
  • Software
  • Tubulin / chemistry
  • Tubulin / genetics

Substances

  • Amino Acids
  • Proteins
  • Tubulin