Application of genetic programming to text categorization

Created by W.Langdon from gp-bibliography.bib Revision:1.4221

  author =       "Wojciech M. Chrosny",
  title =        "Application of genetic programming to text
  school =       "Computer Science, Polytechnic University",
  year =         "2000",
  month =        jan,
  keywords =     "genetic algorithms, genetic programming",
  URL =          "",
  URL =          "",
  size =         "155 pages",
  abstract =     "This dissertation uses genetic programming in text
                 categorization problems. Genetic programming algorithms
                 are applied to a set of news articles to evolve
                 programs that determine whether the article belongs to
                 a particular category. The programs are randomly
                 generated from the set of initial functions and
                 constants. Programs with the fewest amount of false
                 assignments are favoured in the selection for
                 recombination in the subsequent iterations of the
                 genetic programming algorithm. The form of the solution
                 is not determined a priori as in other text
                 categorization methods. The basis set of functions and
                 constants used by the genetic analysis program are
                 specified in advance and may include the three basic
                 logical functions and a set of vocabulary words. Other
                 sets of basis functions can be supplied to the genetic
                 algorithm to obtain different programs. The form in
                 which these functions and constants are combined is
                 determined randomly by the genetic algorithm. The
                 results indicate that genetic programming methods are
                 in the cases examined as good and slightly better than
                 other decision tree or rule induction methods described
                 by Apte et. al. [Apte 1994]. The Genetic Programming
                 methods used a simpler set of features and functions:
                 no word stemming no explicit stop word removal, local
                 dictionary, Boolean functions. The F1-measure of
                 categorization performance of 80.percent achieved by
                 Genetic Programming compares favorably with 78.5percent
                 break even performance of traditional Boolean rule
                 induction methods. It is comparable with 80.5percent
                 Breakeven performance of the rule induction methods
                 with a more complex feature set such as word frequency
                 [Apte 1994]. Characteristics of Genetic Programming
                 text categorization were studied to understand the
                 sensitivity of Genetic Programming methods to
                 vocabulary size, population size, training and testing
                 set selection methods. Temporal characteristics of the
                 Reuters Article Corpus [Lewis-21578) were studied. The
                 results are of interest to both Genetic Programming as
                 well as Traditional categorization methods and may
                 point to significant future performance improvements in
                 both domains. In some cases these results were better
                 than Apte's.",
  notes =        "Supervisor: Robert J. Flynn

                 Wojciech Marek Chrosny UMI Microform 9949161",

Genetic Programming entries for Wojciech M Chrosny