A Genetic Programming Approach to Record Deduplication

Created by W.Langdon from gp-bibliography.bib Revision:1.4448

  author =       "Moises G. {de Carvalho} and Alberto H. F. Laender and 
                 Marcos Andre Goncalves and Altigran S. {da Silva}",
  title =        "A Genetic Programming Approach to Record
  journal =      "IEEE Transactions on Knowledge and Data Engineering",
  year =         "2012",
  month =        mar,
  volume =       "24",
  number =       "3",
  pages =        "399--412",
  abstract =     "Several systems that rely on consistent data to offer
                 high quality services, such as digital libraries and
                 e-commerce brokers, may be affected by the existence of
                 duplicates, quasi-replicas, or near-duplicate entries
                 in their repositories. Because of that, there have been
                 significant investments from private and government
                 organisations in developing methods for removing
                 replicas from its data repositories. This is due to the
                 fact that clean and replica-free repositories not only
                 allow the retrieval of higher-quality information but
                 also lead to more concise data and to potential savings
                 in computational time and resources to process this
                 data. In this article, we propose a genetic programming
                 approach to record deduplication that combines several
                 different pieces of evidence extracted from the data
                 content to find a deduplication function that is able
                 to identify whether two entries in a repository are
                 replicas or not. As shown by our experiments, our
                 approach outperforms an existing state-of-the-art
                 method found in the literature. Moreover, the suggested
                 functions are computationally less demanding since they
                 use fewer evidence. In addition, our genetic
                 programming approach is capable of automatically
                 adapting these functions to a given fixed replica
                 identification boundary, freeing the user from the
                 burden of having to choose and tune this parameter.",
  keywords =     "genetic algorithms, genetic programming, computational
                 time, data repositories, database administration,
                 database integration, digital libraries, e-commerce
                 brokers, fixed replica identification boundary,
                 information retrieval, record deduplication, replica
                 removal, replica-free repositories, genetic algorithms,
                 information retrieval, replicated databases",
  size =         "14 pages",
  DOI =          "doi:10.1109/TKDE.2010.234",
  ISSN =         "1041-4347",
  notes =        "Also known as \cite{5645623}",

Genetic Programming entries for Moises G de Carvalho Alberto H F Laender Marcos Andre Goncalves Altigran S da Silva