Learning Expressive Linkage Rules for Entity Matching using Genetic Programming

Created by W.Langdon from gp-bibliography.bib Revision:1.4496

  author =       "Robert Isele",
  title =        "Learning Expressive Linkage Rules for Entity Matching
                 using Genetic Programming",
  school =       "Mannheim",
  year =         "2013",
  address =      "Germany",
  month =        "10 " # jul,
  keywords =     "genetic algorithms, genetic programming, Entity
                 Matching, Record Linkage, Data Integration, Linkage
                 Rules, Active Learning",
  URL =          "https://ub-madoc.bib.uni-mannheim.de/33418/",
  URL =          "https://ub-madoc.bib.uni-mannheim.de/33418/1/Isele_Dissertation.pdf",
  size =         "224 pages",
  abstract =     "A central problem in data integration and data
                 cleansing is to identify pairs of entities in data sets
                 that describe the same real-world object. Many existing
                 methods for matching entities rely on explicit linkage
                 rules, which specify how two entities are compared for
                 equivalence. Unfortunately, writing accurate linkage
                 rules by hand is a non-trivial problem that requires
                 detailed knowledge of the involved data sets. Another
                 important issue is the efficient execution of linkage
                 rules. In this thesis, we propose a set of novel
                 methods that cover the complete entity matching
                 workflow from the generation of linkage rules using
                 genetic programming algorithms to their efficient
                 execution on distributed systems. First, we propose a
                 supervised learning algorithm that is capable of
                 generating linkage rules from a gold standard
                 consisting of set of entity pairs that have been
                 labelled as duplicates or non-duplicates. We show that
                 the introduced algorithm outperforms previously
                 proposed entity matching approaches including the
                 state-of-the-art genetic programming approach by de
                 Carvalho et al. and is capable of learning linkage
                 rules that achieve a similar accuracy than the human
                 written rule for the same problem. In order to also
                 cover use cases for which no gold standard is
                 available, we propose a complementary active learning
                 algorithm that generates a gold standard interactively
                 by asking the user to confirm or decline the
                 equivalence of a small number of entity pairs. In the
                 experimental evaluation, labelling at most 50 link
                 candidates was necessary in order to match the
                 performance that is achieved by the supervised GenLink
                 algorithm on the entire gold standard. Finally, we
                 propose an efficient execution work flow that can be
                 run on cluster of multiple machines. The execution
                 workflow employs a novel multidimensional indexing
                 method that allows the efficient execution of learnt
                 linkage rules by reducing the number of required
                 comparisons significantly.",
  notes =        "Supervised by Professor Bizer and Professor

                 Woody Allen in DBpedia and Freebase",

Genetic Programming entries for Robert Isele