YAKUSA: a fast structural database scanning method

Proteins. 2005 Oct 1;61(1):137-51. doi: 10.1002/prot.20517.


YAKUSA is a program designed for rapid scanning of a structural database with a query protein structure. It searches for the longest common substructures called SHSPs (structural high-scoring pairs) existing between a query structure and every structure in the structural database. It makes use of protein backbone internal coordinates (alpha angles) in order to describe protein structures as sequences of symbols. The structural similarities are established in 5 steps, the first 3 being analogous to those used in BLAST: (1) building up a deterministic finite automaton describing all patterns identical or similar to those in the query structure; (2) searching for all these patterns in every structure in the database; (3) extending the patterns to longer matching substructures (i.e., SHSPs); (4) selecting compatible SHSPs for each query-database structure pair; and (5) ranking the query-database structure pairs using 3 scores based on SHSP similarity, on SHSP probabilities, and on spatial compatibility of SHSPs. Structural fragment probabilities are estimated according to a mixture transition distribution model, which is an approximation of a high-order Markov chain model. With regard to sensitivity and selectivity of the structural matches, YAKUSA compares well to the best related programs, although it is by far faster: A typical database scan takes about 40 s CPU time on a desktop personal computer. It has also been implemented on a Web server for real-time searches.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Databases, Protein*
  • Internet
  • Markov Chains
  • Models, Molecular
  • Protein Folding
  • Protein Structure, Quaternary
  • Protein Structure, Tertiary
  • Proteins / chemistry*
  • Proteins / classification
  • Proteins / metabolism
  • Sensitivity and Specificity
  • Software*
  • Structural Homology, Protein
  • Time Factors


  • Proteins