Estimating the Credibility of Examples in Automatic Document Classification

Created by W.Langdon from gp-bibliography.bib Revision:1.4549

  author =       "Joao R. M. Palotti and Thiago Salles and 
                 Gisele L. Pappa and Filipe {de Lima Arcanjo} and 
                 Marcos Andre Goncalves and Wagner {Meira. Jr.}",
  title =        "Estimating the Credibility of Examples in Automatic
                 Document Classification",
  journal =      "Journal of Information and Data Management",
  year =         "2010",
  volume =       "1",
  number =       "3",
  pages =        "439--454",
  month =        oct,
  keywords =     "genetic algorithms, genetic programming, credibility,
                 automatic document classification",
  URL =          "",
  biburl =       "",
  annote =       "The Pennsylvania State University CiteSeerX Archives",
  language =     "en",
  oai =          "oai:CiteSeerX.psu:",
  URL =          "",
  URL =          "",
  size =         "16 pages",
  abstract =     "Classification algorithms usually assume that any
                 example in the raining set should contribute equally to
                 the classification model being generated. However, this
                 is not always the case. This paper shows that the
                 contribution of an example to the classification model
                 varies according to many factors, which are application
                 dependent, and can be estimated using what we call a
                 credibility function. The credibility of an entity
                 reflects how much value it aggregates to a task being
                 performed, and here we investigate it in Automatic
                 Document Classification, where the credibility of a
                 document relates to its terms, authors, citations,
                 venues, time of publication, among others. After
                 introducing the concept of credibility in
                 classification, we investigate how to estimate a
                 credibility function using information regarding
                 documents content, citations and authorship using
                 mainly metrics previously defined in the literature. As
                 the credibility of the content of a document can be
                 easily mapped to any other classification problem, in a
                 second phase we focus on content-based credibility
                 functions. We propose a genetic programming algorithm
                 to estimate this function based on a large set of
                 metrics generally used to measure the strength of
                 term-class relationship. The proposed and evolved
                 credibility functions are then incorporated to the
                 Naive Bayes classifier, and applied to four text
                 collections, namely ACM-DL, Reuters, Ohsumed, and 20
                 Newsgroup. The results obtained showed significant
                 improvements in both micro-F1 and macro-F1, with gains
                 up to 21percent in Ohsumed when compared to the
                 traditional Naive Bayes.",
  notes =        "SBBD 2010",

Genetic Programming entries for Joao Palotti Thiago Cunha de Moura Salles Gisele L Pappa Filipe de Lima Arcanjo Marcos Andre Goncalves Wagner Meira