Record Deduplication By Evolutionary Means

Created by W.Langdon from gp-bibliography.bib Revision:1.4448

  author =       "Marco Modesto and Moises G. {de Carvalho} and 
                 Walter {dos Santos}",
  title =        "Record Deduplication By Evolutionary Means",
  howpublished = "CiteSeerX",
  year =         "2002?",
  address =      "Departamento de Ciencia da Computacao, Universidade
                 Federal de Minas Gerais, Belo Horizonte, MG, Brazil",
  keywords =     "genetic algorithms, genetic programming",
  annote =       "The Pennsylvania State University CiteSeerX Archives",
  bibsource =    "OAI-PMH server at",
  language =     "en",
  oai =          "oai:CiteSeerX.psu:",
  rights =       "Metadata may be used without restrictions as long as
                 the oai identifier remains attached to it.",
  URL =          "",
  URL =          "",
  size =         "5 pages",
  abstract =     "Identifying record replicas in digital data
                 repositories is a key step to improve the quality of
                 content and services available, as well as to yield
                 eventual sharing efforts. Several deduplication
                 strategies are available, but most of them rely on
                 manually chosen settings to combine evidence used to
                 identify records as being replicas. In this work, we
                 present the results of experiments we have carried out
                 with a Machine Learning approach for the deduplication
                 problem. Our approach is based on Genetic Programming
                 (GP), that is able to automatically generate similarity
                 functions to identify record replicas in a given
                 repository. The generated similarity functions properly
                 combine and weight the best evidence available among
                 the record fields in order to tell when two distinct
                 records represent the same real-world entity. On a
                 previous work, fixed similarity functions were
                 associated to each evidence. On the present work, the
                 GP will be also used to choose the best evidence and
                 similarity functions associations. The results of the
                 experiments show that our approach outperforms the
                 baseline method by Fellegi and Sunter. It also
                 outperformed the previous GP results, using fixed
                 evidence associations when identifying replicas in a
                 data set containing researcher's personal data.",
  notes =        "Oct 2017 reference by

Genetic Programming entries for Marco Modesto Moises G de Carvalho Walter dos Santos