Symbolic Regression for Knowledge Discovery - Bloat, Overfitting, and Variable Interaction Networks

Created by W.Langdon from gp-bibliography.bib Revision:1.4333

  author =       "Gabriel K. Kronberger",
  title =        "Symbolic Regression for Knowledge Discovery - Bloat,
                 Overfitting, and Variable Interaction Networks",
  school =       "Johannes Kepler University",
  year =         "2010",
  address =      "Linz, Austria",
  keywords =     "genetic algorithms, genetic programming",
  URL =          "",
  URL =          "",
  size =         "211 pages",
  abstract =     "With the growing amount of data that are collected and
                 recorded in various application areas the need to use
                 these data is also growing. In science, data have
                 always played an important role; in recent years,
                 however, the economic potential of data has also become
                 increasingly important. In combination with methods for
                 data analysis, data can be used to their full
                 potential, whether in the commercial sector to optimise
                 offers, or in the industrial sector to optimize
                 resources and product quality based on process data.
                 This work describes a new approach for the analysis of
                 data which is based on symbolic regression with genetic
                 programming and aims to generate an overall view of the
                 interactions of various variables of a system. By this
                 means, all potentially interesting relationships, which
                 can be detected in a dataset, should be identified and
                 represented as compact and understandable models. In
                 the first part of this work, this approach of
                 comprehensive symbolic regression is described in
                 detail. Important issues that play a role in the
                 process are the prevention of bloat and over-fitting,
                 the simplification of models, and the identification of
                 relevant input variables. In this context, different
                 methods for bloat control and prevention are presented
                 and compared. In particular, the influence of offspring
                 selection on bloat is analysed. In addition, a new way
                 to detect over-fitting is presented. On the basis of
                 this, extensions for the reduction of over-fitting are
                 presented and compared. Pruning of models is featured
                 prominently, on the one hand to prevent over-fitting
                 and on the other hand to simplify complex models. An
                 important aspect is the analysis of the vast amount of
                 different models that results from the proposed
                 approach. In this context, different methods to
                 quantify relevant factors are proposed. These methods
                 can be used to identify interactions of variables of
                 the analysed system. Visualising such interactions
                 provides a general overview of the system in question
                 which would not be possible by analysis of individual
                 models which are concentrated on selected aspects of
                 the problem. Additionally, the prognosis of
                 multivariate time series with genetic programming is
                 described in the first part. The second part of this
                 work shows how the described approach can be applied to
                 the analysis of real-world systems, and how the result
                 of this data analysis process can result in the gain of
                 new knowledge about the investigated system. The
                 analyzed data stem from a blast furnace for the
                 production of steel and an industrial chemical process.
                 In addition the same approach is also applied on a data
                 collection storing economic data in order to identify
                 macro-economic interactions.",
  notes =        "See also


Genetic Programming entries for Gabriel Kronberger