Evolutionary Approaches to Data Integration Related Problems

Created by W.Langdon from gp-bibliography.bib Revision:1.4524

  author =       "Moises Gomes {de Carvalho}",
  title =        "Evolutionary Approaches to Data Integration Related
  school =       "Computer Science of the Federal University of Minas
  year =         "2009",
  address =      "Belo Horizonte, Brazil",
  month =        "26 " # oct,
  keywords =     "genetic algorithms, genetic programming, Data
                 Integration, Record Deduplication, Schema Matching",
  keywords_pt =  "Programacao genetica, Integracao de dados,
                 Deduplicacao de registros",
  URL =          "http://www.dcc.ufmg.br/pos/cursos/defesas/901D.PDF",
  size =         "138 pages",
  abstract =     "Data integration aims to combine data from different
                 sources (data repositories such as databases, digital
                 libraries, etc.) by adopting a global data model and by
                 detecting and resolving schema and data conflicts so
                 that a homogeneous, unified view can be provided. Two
                 specific problems related to data integration - schema
                 matching and replica identification - present a large
                 solution space. This space is computationally expensive
                 and technically prohibitive to be intensively and
                 exhaustively explored by traditional approaches.
                 Moreover, the solutions for these problems usually
                 require that multiple, sometimes conflicting,
                 objectives must be simultaneously attended. This thesis
                 aims to show that evolutionary-based techniques can be
                 successfully applied to such problems, leading to novel
                 approaches and methods that address all aforementioned
                 requirements and, at the same time, provide efficient
                 and high accuracy solutions.

                 In this thesis, we first propose a genetic programming
                 approach to record deduplication. This approach
                 combines several different pieces of evidence extracted
                 from the actual data present in the repositories to
                 suggest a deduplication function that is able to
                 identify whenever two entries in a repository are
                 replicas or not. As shown by our experiments, our
                 approach outperforms existing state-of-the-art methods
                 found in the literature. Moreover, the suggested
                 function is computationally less demanding since it
                 uses fewer evidence. Finally, it is also important to
                 notice that our approach is capable of automatically
                 adapting to a given fixed replica identification
                 boundary, freeing the user from the burden of having to
                 choose and tune this parameter

                 Based on the previous approach, we also devised a novel
                 evolutionary approach, that is able to automatically
                 find complex schema matches. Our aim was to develop a
                 method to find semantic relationships between schema
                 elements, in a restricted scenario in which only the
                 data instances are available. To the best of our
                 knowledge, this is the first approach that is capable
                 of discovering complex schema matches using only the
                 data instances, which is performed by exploiting record
                 deduplication and information retrieval techniques to
                 find schema matches during the evolutionary process. To
                 demonstrate the effectiveness of our approach, we
                 conducted an experimental evaluation using real-world
                 and synthetic datasets. Our results show that our
                 approach is able to find complex matches with high
                 accuracy, despite using only the data instances.",
  notes =        "supervisor: Alberto Henrique Frade Laender",

Genetic Programming entries for Moises G de Carvalho