Reducing overfitting in genetic programming models for software quality classification

Created by W.Langdon from gp-bibliography.bib Revision:1.4420

  author =       "Yi Liu and Taghi Khoshgoftaar",
  title =        "Reducing overfitting in genetic programming models for
                 software quality classification",
  booktitle =    "Proceedings of the Eighth IEEE Symposium on
                 International High Assurance Systems Engineering",
  year =         "2004",
  month =        "25-26 " # mar,
  pages =        "56--65",
  address =      "Tampa, Florida, USA",
  keywords =     "genetic algorithms, genetic programming",
  ISSN =         "1530-2059",
  DOI =          "doi:10.1109/HASE.2004.1281730",
  DOI =          "doi:10.1109/HASE.2004.1281730",
  size =         "10 pages",
  abstract =     "A high-assurance system is largely dependent on the
                 quality of its underlying software. Software quality
                 models can provide timely estimations of software
                 quality, allowing the detection and correction of
                 faults prior to operations. A software metrics-based
                 quality prediction model may depict overfitting, which
                 occurs when a prediction model has good accuracy on the
                 training data but relatively poor accuracy on the test
                 data. In this paper, we present an approach to address
                 the overfitting problem in the context of software
                 quality classification models based on genetic
                 programming (GP). The overfitting problem has not been
                 addressed in depth for GP-based models. The general aim
                 of classifying software modules as fault-prone (fp) and
                 not fault-prone (nfp) is to aid software management in
                 expending its limited resources toward improving only
                 the fp modules. The presence of overfitting in such a
                 software quality model affects its practical
                 usefulness, because management is interested in good
                 performance of the model when applied to unseen data,
                 i.e., generalisation performance. In the process of
                 building GP-based software quality classification
                 models for a high-assurance telecommunications system,
                 we observed that the GP models were prone to
                 overfitting. We use a random sampling technique to
                 reduce overfitting in our GP models. The approach has
                 been found by many researchers as an effective method
                 for reducing the time of a GP run. However, in our
                 study we use random sampling to reduce overfitting with
                 the aim of improving the generalization capability of
                 our GP models. A case study of an industrial
                 high-assurance software system is used to demonstrate
                 the effectiveness of the random sampling technique.",
  notes =        "HASE 2004",

Genetic Programming entries for Yi Liu Taghi M Khoshgoftaar