Gclust: trans-kingdom classification of proteins using automatic individual threshold setting

Bioinformatics. 2009 Mar 1;25(5):599-605. doi: 10.1093/bioinformatics/btp047. Epub 2009 Jan 21.

Abstract

Motivation: Trans-kingdom protein clustering remained difficult because of large sequence divergence between eukaryotes and prokaryotes and the presence of a transit sequence in organellar proteins. A large-scale protein clustering including such divergent organisms needs a heuristic to efficiently select similar proteins by setting a proper threshold for homologs of each protein. Here a method is described using two similarity measures and organism count.

Results: The Gclust software constructs minimal homolog groups using all-against-all BLASTP results by single-linkage clustering. Major points include (i) estimation of domain structure of proteins; (ii) exclusion of multi-domain proteins; (iii) explicit consideration of transit peptides; and (iv) heuristic estimation of a similarity threshold for homologs of each protein by entropy-optimized organism count method. The resultant clusters were evaluated in the light of power law. The software was used to construct protein clusters for up to 95 organisms.

Availability: Software and data are available at http://gclust.c.u-tokyo.ac.jp/Gclust_Download.html.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cluster Analysis
  • Computational Biology / methods*
  • Internet
  • Proteins / chemistry
  • Proteins / classification*
  • Sequence Analysis, Protein / methods
  • Software*

Substances

  • Proteins