Detecting homology of distantly related proteins with consensus sequences

J Mol Biol. 1987 Dec 20;198(4):567-77. doi: 10.1016/0022-2836(87)90200-2.

Abstract

A simple protocol is described that is suitable for the detection of distantly related members of a protein family. In this procedure, similarity to a consensus sequence is used to distinguish chance similarity from similarity due to common ancestry. The consensus sequence is constructed from the sequences of established members of a protein family and it incorporates features characteristic of the protein fold of this family: conserved residues, the pattern of variable and conserved segments, preferred location of gaps etc. The database is searched with the consensus sequence, using the unitary matrix or log odds matrix for scoring the alignments, with variable gap penalty. The advantage of the method is that it weights key residues, ignores sequence similarity in variable segments (thus partially eliminating "background noise" coming from chance similarity), distinguishes gaps disrupting conserved segments from those occurring in positions known to be tolerant of gap events. The utility of the method was demonstrated in the case of the protein family homologous with the internal repeats of complement B as well as the internal repeats identified in fibroblast proteoglycan PG40. The consensus sequence method succeeded in finding some new members of these protein families that could not be detected by earlier methods of sequence comparison.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Complement Factor B*
  • Enzyme Precursors*
  • Glycoproteins
  • Molecular Sequence Data
  • Proteins*
  • Proteoglycans
  • beta 2-Glycoprotein I

Substances

  • Enzyme Precursors
  • Glycoproteins
  • LRG1 protein, human
  • Proteins
  • Proteoglycans
  • beta 2-Glycoprotein I
  • Complement Factor B