Relations of the numbers of protein sequences, families and folds

Protein Eng. 1997 Jul;10(7):757-61. doi: 10.1093/protein/10.7.757.


The relations among the numbers of protein sequences, families and folds have been studied theoretically. It is found that the number of families is related to the natural logarithm of the number of sequences. The logarithmic relation should not be changed regardless of what value of the homology threshold is applied in the protein sequence comparison routines. To study the relation between the numbers of families and folds, the degenerate degree of a fold has been introduced. The degenerate degree of a fold is the number of protein families which adopt the same fold. The distribution of the degenerate degrees of folds has been found to be very likely exponential. Based on the distribution, the average degenerate degree d is calculated. The number of folds is simply equal to that of families divided by the average degenerate degree of folds. It is shown that d is an increasing function of time. The current value of d is about 2. It will continue to increase and reach the value of at least 3.3 in some years. By using the above result, the numbers of protein folds for four species have been estimated. In particular, the number of folds for human proteins is estimated to be < or =5200.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Bacterial Proteins / chemistry
  • Caenorhabditis elegans / chemistry
  • Escherichia coli / chemistry
  • Fungal Proteins / chemistry
  • Helminth Proteins / chemistry
  • Humans
  • Models, Chemical
  • Protein Conformation
  • Protein Engineering
  • Protein Folding
  • Proteins / chemistry*
  • Proteins / classification
  • Saccharomyces cerevisiae / chemistry
  • Species Specificity


  • Bacterial Proteins
  • Fungal Proteins
  • Helminth Proteins
  • Proteins