An unsupervised heuristic-based approach for bibliographic metadata deduplication

Created by W.Langdon from gp-bibliography.bib Revision:1.4192

  author =       "Eduardo N. Borges and Moises G. {de Carvalho} and 
                 Renata Galante and Marcos Andre Goncalves and 
                 Alberto H. F. Laender",
  title =        "An unsupervised heuristic-based approach for
                 bibliographic metadata deduplication",
  journal =      "Information Processing \& Management",
  volume =       "47",
  number =       "5",
  pages =        "706--718",
  year =         "2011",
  note =         "Managing and Mining Multilingual Documents",
  ISSN =         "0306-4573",
  DOI =          "doi:10.1016/j.ipm.2011.01.009",
  URL =          "",
  keywords =     "genetic algorithms, genetic programming, Digital
                 libraries, Metadata, Deduplication, Similarity",
  abstract =     "Digital libraries of scientific articles contain
                 collections of digital objects that are usually
                 described by bibliographic meta data records. These
                 records can be acquired from different sources and be
                 represented using several metadata standards. These
                 metadata standards may be heterogeneous in both,
                 content and structure. All of this implies that many
                 records may be duplicated in the repository, thus
                 affecting the quality of services, such as searching
                 and browsing. In this article we present an approach
                 that identifies duplicated bibliographic metadata
                 records in an efficient and effective way. We propose
                 similarity functions especially designed for the
                 digital library domain and experimentally evaluate
                 them. Our results show that the proposed functions
                 improve the quality of metadata de-duplication up to
                 188percent compared to four different baselines. We
                 also show that our approach achieves statistical
                 equivalent results when compared to a state-of-the-art
                 method for replica identification based on genetic
                 programming, without the burden and cost of any
                 training process.",

Genetic Programming entries for Eduardo N Borges Moises G de Carvalho Renata Galante Marcos Andre Goncalves Alberto H F Laender