An Improving Genetic Programming Approach Based Deduplication Using KFINDMR

Created by W.Langdon from gp-bibliography.bib Revision:1.4221

  title =        "An Improving Genetic Programming Approach Based
                 Deduplication Using {KFINDMR}",
  author =       "P. Shanmugavadivu and N. Baskar",
  journal =      "International Journal of Computer Trends and
  year =         "2012",
  volume =       "3",
  number =       "5",
  pages =        "694--701",
  month =        sep # "-" # oct,
  keywords =     "genetic algorithms, genetic programming, extracting
                 data, identifying duplication, deduplication",
  publisher =    "Seventh Sense Research Group",
  ISSN =         "2231-2803",
  bibsource =    "OAI-PMH server at",
  oai =          "oai:doaj-articles:888f8d9f98d711833425c4b976780e4e",
  URL =          "",
  size =         "8 pages",
  abstract =     "The record deduplication is the task of identifying,
                 in a data repository, records that refer to the same
                 real world entity or object in spite of misspelling
                 words, types, different writing styles or even
                 different schema representations or data types. In
                 existing system aims at providing Unsupervised
                 Duplication Detection (UDD) method which can be used to
                 identify and remove the duplicate records from
                 different data sources. Starting from the non duplicate
                 set, the two cooperating classifiers, a Weighted
                 Component Similarity Summing Classifier (WCSS) and
                 Support Vector Machine (SVM) are used to iteratively
                 identify the duplicate records from the non duplicate
                 record and present a genetic programming (GP) approach
                 to record deduplication. Their GP-based approach is
                 also able to automatically find effective deduplication
                 functions. The genetic programming approach is time
                 consuming task so we propose new algorithm
                 KFINDMR(KFIND using Most Represented data samples) to
                 find the most represented data samples to improve the
                 accuracy of the classifier. The proposed system
                 calculates the mean value of the most represented data
                 samples in centroid of the record members; it selects
                 the first most represented data sample that closest to
                 the mean value calculates the minimum distance. The
                 system Remove the duplicate dataset samples in the
                 system and find the optimisation solution to
                 deduplication of records or data samples.",

Genetic Programming entries for P Shanmugavadivu N Baskar