Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients

J Chem Inf Comput Sci. Nov-Dec 2002;42(6):1407-14. doi: 10.1021/ci025531g.


2D fragment-based similarity searching is one of the most popular techniques for searching a large database of chemical structures and has been widely applied in drug discovery. However, its performance, especially its effectiveness in retrieving active structural analogues, has not been adequately studied. We report a series of computational experiments, where we systematically studied the influence of structural descriptors and similarity coefficients on the effectiveness of similarity searching. The study was conducted using two public large data sets, NCI anti-AIDS and MDDR. Four sets of 2D linear fragment descriptors, based on the original definitions of atom pairs and atom sequences, were compared. The effect of using the Tanimoto coefficient and the Euclidean distance was studied as a function of descriptor set. The results clearly indicate that the Tanimoto coefficient is superior to the Euclidean distance in 2D-fragment based similarity searching, in terms of hit rate, while atom sequences demonstrate the best overall performance among the structural descriptors we studied.

MeSH terms

  • Anti-HIV Agents / chemistry*
  • Computer Simulation
  • Databases, Factual*
  • Drug Evaluation, Preclinical / methods*
  • Information Storage and Retrieval
  • Models, Chemical
  • Molecular Structure
  • National Institutes of Health (U.S.)
  • United States


  • Anti-HIV Agents