The Google Similarity Distance

Created by W.Langdon from gp-bibliography.bib Revision:1.4504

  author =       "Rudi L. Cilibrasi and Paul M. B. Vitanyi",
  title =        "The Google Similarity Distance",
  journal =      "IEEE Transactions on Knowledge and Data Engineering",
  year =         "2007",
  volume =       "19",
  number =       "3",
  pages =        "370--383",
  month =        mar,
  keywords =     "genetic algorithms, genetic programming",
  ISSN =         "1041-4347",
  DOI =          "doi:10.1109/TKDE.2007.48",
  abstract =     "Words and phrases acquire meaning from the way they
                 are used in society, from their relative semantics to
                 other words and phrases. For computers, the equivalent
                 of {"}society{"} is {"}database,{"} and the equivalent
                 of {"}use{"} is {"}a way to search the database{"}. We
                 present a new theory of similarity between words and
                 phrases based on information distance and Kolmogorov
                 complexity. To fix thoughts, we use the World Wide Web
                 (WWW) as the database, and Google as the search engine.
                 The method is also applicable to other search engines
                 and databases. This theory is then applied to construct
                 a method to automatically extract similarity, the
                 Google similarity distance, of words and phrases from
                 the WWW using Google page counts. The WWW is the
                 largest database on earth, and the context information
                 entered by millions of independent users averages out
                 to provide automatic semantics of useful quality. We
                 give applications in hierarchical clustering,
                 classification, and language translation. We give
                 examples to distinguish between colours and numbers,
                 cluster names of paintings by 17th century Dutch
                 masters and names of books by English novelists, the
                 ability to understand emergencies and primes, and we
                 demonstrate the ability to do a simple automatic
                 English-Spanish translation. Finally, we use the
                 WordNet database as an objective baseline against which
                 to judge the performance of our method. We conduct a
                 massive randomized trial in binary classification using
                 support vector machines to learn categories based on
                 our Google distance, resulting in an a mean agreement
                 of 87 percent with the expert crafted WordNet
  notes =        "Also known as \cite{4072748}",

Genetic Programming entries for Rudi Cilibrasi Paul M B Vitanyi