Normalized Compression Distance of Multisets with Applications

IEEE Trans Pattern Anal Mach Intell. 2015 Aug;37(8):1602-14. doi: 10.1109/TPAMI.2014.2375175.

Abstract

Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For classification purposes it is superior to the pairwise NCD in accuracy and implementation complexity. We cover the entire trajectory from theoretical underpinning to feasible practice. It is applied to biological (stem cell, organelle transport) and OCR classification questions that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. The theoretic foundation is Kolmogorov complexity.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Data Compression / methods*
  • Data Mining / methods*
  • Databases, Factual
  • Handwriting
  • Humans
  • Mice
  • Models, Theoretical
  • Organelles / metabolism
  • Pattern Recognition, Automated / methods*
  • Retina / cytology
  • Stem Cells / cytology