Clustering of highly homologous sequences to reduce the size of large protein databases

Bioinformatics. 2001 Mar;17(3):282-3. doi: 10.1093/bioinformatics/17.3.282.


We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Databases, Factual*
  • Proteins / analysis*
  • Sequence Analysis
  • Software*


  • Proteins