Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming

Created by W.Langdon from gp-bibliography.bib Revision:1.4524

  author =       "Ekaterina Vladislavleva",
  title =        "Model-based Problem Solving through Symbolic
                 Regression via Pareto Genetic Programming",
  school =       "Tilburg University",
  year =         "2008",
  address =      "Tilburg, the Netherlands",
  month =        aug,
  isbn13 =       "978 90 5668 217 0",
  keywords =     "genetic algorithms, genetic programming, symbolic
  URL =          "",
  broken =       "",
  size =         "288 pages",
  abstract =     "The main focus of this dissertation is identification
                 of relationships from given input-output data by means
                 of symbolic regression. The challenging task of
                 symbolic regression is to identify and express a real
                 or simulated system or a process, based on a limited
                 number of observations of the system's behaviour.

                 The system under study is being characterised by some
                 important control parameters which need to be available
                 for an observer, but usually are difficult to monitor,
                 e.g. they need to be measured in a lab, simulated or
                 observed in real time only, or at high time and
                 computational expenses. Empirical modelling attempts to
                 express these critical control variables via other
                 controllable variables that are easier to monitor, can
                 be measured more accurately or timely, are cheaper to
                 simulate, etc. Symbolic regression provides such
                 expressions of crucial process characteristics, or,
                 response variables, defined (symbolically) as
                 mathematical functions of some of the easy-to-measure
                 input variables, and calls these expressions empirical
                 input-response models. Examples of these are (i)
                 structure-activity relationships in pharmaceutical
                 research, which define the activity of a drug through
                 the physical structure of molecules of drug components,
                 (ii) structure-property relationships in material
                 science, which define product qualities, such as
                 shininess, opacity, smell, or stiffness through
                 physical properties of composites and processing
                 conditions, or (iii) economic models, e.g. expressing
                 return on investment through daily closes of S&P 500
                 quotes and in ation rates.",
  abstract =     "Industrial modelling problems that are tractable for
                 symbolic regression have two main characteristics: (1)
                 No or little information is known about the underlying
                 system producing the data, and therefore no assumptions
                 on model structure can be made; (2) The available data
                 is high-dimensional, and often not balanced, with
                 either abundant or insufficient number of samples.

                 To discover plausible models with realistic time and
                 computational efforts, symbolic regression exploits a
                 stochastic iterative search technique, based on
                 artificial evolution of model expressions. This method,
                 called genetic programming looks for appropriate
                 expressions of the response variable in the space of
                 all valid formulae containing a minimal set of input
                 variables and a proposed set of basic operators and

                 At each step, the genetic programming system considers
                 a sufficiently large quantity of various formulae,
                 selects the subset of the best formulae according to
                 certain user-defined criteria of goodness, and
                 (re)combines the best formulae to create a rich set of
                 potential solutions for the next step. This approach is
                 inspired by principles of natural selection, where the
                 offspring that inherits good features from both parents
                 increases the chances to be successful in survival,
                 adaptation, and further propagation. The challenge and
                 the rationale of performing evolutionary search is to
                 balance the exploitation of the good solutions
                 discovered so far, with exploration of the new areas of
                 the search space, where even better solutions may be
  abstract =     "The fact that symbolic regression via genetic
                 programming (GP) does not impose any assumptions on the
                 structure of the input-output models means that the
                 model structure is to a large extent determined by data
                 and also by selection objectives used in the
                 evolutionary search. On one hand, it is an advantage
                 and the unique capability compared with other global
                 approximation techniques, since it potentially allows
                 to develop inherently simpler models than, for example,
                 by interpolation with polynomials or spatial
                 correlation analysis. On the other hand, the absence of
                 constraints on model structure is the greatest
                 challenge for symbolic regression since it vastly
                 increases the search space of possible solutions which
                 is already inherently large.

                 A special multi-objective flavour of a genetic
                 programming search is considered, called Pareto GP.
                 Pareto GP used for symbolic regression has strong
                 advantages in creating diverse sets of regression
                 models, satisfying competing criteria of model
                 structural simplicity and model prediction accuracy.",
  abstract =     "This thesis extends the Pareto genetic programming
                 methodology by additional generic model selection and
                 generation strategies that (1) drive the modelling
                 engine to creation of models of reduced non-linearity
                 and increased generalisation capabilities, and (2)
                 improve the effectiveness of the search for robust
                 models by goal softening, adaptive fitness evaluations,
                 and enhanced training strategies.

                 In addition to the new strategies for model development
                 and model selection, this dissertation presents a new
                 approach for analysis, ranking, and compression of
                 given multi-dimensional input-output data for the
                 purpose of balancing the information content in
                 undesigned data sets.

                 To present contributions of this research in the
                 context of real-life problem solving, the dissertation
                 exploits a generic framework of adaptive model-based
                 problem solving used in many industrial modelling
                 applications. This framework consists of an iterative
                 feed-back loop over: (Part I) data generation, analysis
                 and adaptation, (Part II) model development, and (Part
                 III) problem analysis and reduction.",
  abstract =     "Part I of the thesis consists of Chapter 2 and is
                 devoted to data analysis. It studies the ways to
                 balance multi-dimensional input-output data for making
                 further modelling more successful. Chapter 2 proposes
                 several novel methods for interpretation and
                 manipulation of given high-dimensional input-output
                 data such as relative weighting the data, ranking the
                 data records in the order of increasing importance, and
                 accessing the compressibility and information content
                 of a multi-dimensional data set. All methods exploit
                 the geometrical structure of the data and relative
                 distances to nearest-in-the-input space neighbours. All
                 methods treat response values differently, assuming
                 that the data belongs to a response surface, which
                 needs to be identified.

                 Part II of the thesis consist of Chapters 3-7 and
                 addresses the model induction method - Pareto genetic
                 programming. Since time to solution, or, more
                 accurately, time-to-convincing-solution is a major
                 practical challenge of evolutionary search algorithms,
                 and Pareto GP in particular, Part II focuses on
                 algorithmic enhancements of Pareto GP that lead it to
                 the discovery of better solutions faster (i.e.
                 solutions of sufficient quality at a smaller
                 computational effort, or of considerably better quality
                 at the same computational effort).",
  abstract =     "

                 In Chapter 3 a general description of the Pareto GP
                 methodology is presented in a framework of evolutionary
                 search, as an iterative loop over the stages of model
                 generation, model evaluation, and model selection.

                 In Chapter 5 a novel strategy for model selection
                 through explicit non-linearity control is presented. A
                 new complexity measure called the order of
                 non-linearity of symbolic models is introduced and used
BibTeX entry too long. Truncated

Genetic Programming entries for Ekaterina (Katya) Vladislavleva