Clustering by Compression

Created by W.Langdon from gp-bibliography.bib Revision:1.4549

  author =       "Rudi Cilibrasi and Paul M. B. Vitanyi",
  title =        "Clustering by Compression",
  journal =      "IEEE Transactions on Information Theory",
  year =         "2005",
  volume =       "51",
  number =       "4",
  pages =        "1523--1545",
  month =        apr,
  keywords =     "genetic algorithms, genetic programming, complearn,
                 universal dissimilarity distance, normalised
                 compression distance, hierarchical unsupervised
                 clustering, quartet tree method, parameter-free
                 data-mining, heterogenous data analysis, Kolmogorov
  ISSN =         "0018-9448",
  URL =          "",
  DOI =          "doi:10.1109/TIT.2005.844059",
  size =         "21 pages",
  abstract =     "We present a new method for clustering based on
                 compression. The method doesn't use subject-specific
                 features or background knowledge, and works as follows:
                 First, we determine a parameter-free, universal,
                 similarity distance, the normalized compression
                 distance or NCD , computed from the lengths of
                 compressed data files (singly and in pairwise
                 concatenation). Second, we apply a hierarchical
                 clustering method. The NCD is not restricted to a
                 specific application area, and works across application
                 area boundaries. A theoretical precursor, the
                 normalised information distance, co-developed by one of
                 the authors, is provably optimal. However, the
                 optimality comes at the price of using the
                 non-computable notion of Kolmogorov complexity. We
                 propose axioms to capture the real-world setting, and
                 show that the NCD approximates optimality. To extract a
                 hierarchy of clusters from the distance matrix, we
                 determine a dendrogram (binary tree) by a new quartet
                 method and a fast heuristic to implement it. The method
                 is implemented and available as public software, and is
                 robust under choice of different compressors. To
                 substantiate our claims of universality and robustness,
                 we report evidence of successful application in areas
                 as diverse as genomics, virology, languages,
                 literature, music, handwritten digits, astronomy, and
                 combinations of objects from completely different
                 domains, using statistical, dictionary, and block
                 sorting compressors. In genomics we presented new
                 evidence for major questions in Mammalian evolution,
                 based on whole-mitochondrial genomic analysis: the
                 Eutherian orders and the Marsupionta hypothesis against
                 the Theria hypothesis.",
  notes =        "With respect to the version published in the IEEE
                 Trans. Inform. Th., 51:4(2005), 1523?1545, we have
                 changed Definition 2.1 of 'admissible distance' making
                 it more general and Definitions 2.4 and 2.5 of
                 'normalized admissible distance,' consequently adapted
                 Lemma 2.6 (II.2) and in its proof (II.3) and the
                 displayed inequalities. This left Theorem 6.3 unchanged
                 except for changing 'such that d(x; y) \le e' to 'such
                 that d(x; y) \le e and C(v) \le C(x).'",

Genetic Programming entries for Rudi Cilibrasi Paul M B Vitanyi