Improved Rule-based Document Representation and Classification using Genetic Programming

Created by W.Langdon from gp-bibliography.bib Revision:1.4420

  author =       "Yasaman Soltan-Zadeh",
  title =        "Improved Rule-based Document Representation and
                 Classification using Genetic Programming",
  school =       "Royal Holloway, University of London",
  year =         "2011",
  address =      "Egham, Surrey TW20 0EX, UK",
  month =        oct,
  keywords =     "genetic algorithms, genetic programming, information
                 retrieval, machine learning, classification",
  URL =          "",
  URL =          "",
  URL =          "",
  size =         "186 pages",
  abstract =     "In the field of information retrieval and in
                 particular classification, the mathematical and
                 statistical rules and classifiers are not human
                 readable. Non-human readable rules and classifiers act
                 as a barrier in using expert knowledge to improve

                 Such barriers can be overcome using genetic
                 programming. The aim of this thesis is to produce
                 classifiers and in particular document representatives
                 which are human readable using genetic programming.
                 Human readability makes these representatives more
                 interactive and adaptable by providing the possibility
                 of integrating expert knowledge.

                 Genetic programming as a non-deterministic method with
                 high flexibility is among the best options to produce
                 human readable document representatives. To test the
                 results of the chosen method, standard test collections
                 are used. These standard test collections guarantee
                 that the experiments are replicable and the results are
                 reproducible by other researchers.

                 This thesis demonstrates the process of producing human
                 readable document representatives with transparency for
                 further modification and analysis by expert knowledge,
                 while retaining the performance.

                 To obtain these findings, this thesis has contributed
                 to the field by developing a system that introduces a
                 novel tree structure to improve the feature selection
                 process, and a novel fitness function to improve the
                 quality of representative generator.

                 To produce a human readable representative the tree
                 structure is changed into a new shape with more control
                 on the number of children. This reduces the depth of
                 each tree for certain number of features and results in
                 a flatter structure. A fitness function is constructed
                 by combination of classification accuracy on training
                 and validation sets and a parsimony component. This
                 study found that the order of matched document with
                 representatives can improve overall performance.
                 Different feature selections are investigated and
                 integrated into our genetic programming based feature
                 selection method which is based on a probability
                 distribution derived from the feature weights.",
  notes =        "

                 Supervisors: Masoud Saeedi and Ashok Jashapara",

Genetic Programming entries for Yasaman Soltan-Zadeh