# Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming

Created by W.Langdon from gp-bibliography.bib Revision:1.4524

```@PhdThesis{vladislavleva:2008:thesis,
title =        "Model-based Problem Solving through Symbolic
Regression via Pareto Genetic Programming",
school =       "Tilburg University",
year =         "2008",
month =        aug,
isbn13 =       "978 90 5668 217 0",
keywords =     "genetic algorithms, genetic programming, symbolic
regression",
URL =          "http://arno.uvt.nl/show.cgi?fid=80764",
size =         "288 pages",
abstract =     "The main focus of this dissertation is identification
of relationships from given input-output data by means
of symbolic regression. The challenging task of
symbolic regression is to identify and express a real
or simulated system or a process, based on a limited
number of observations of the system's behaviour.

The system under study is being characterised by some
important control parameters which need to be available
for an observer, but usually are difficult to monitor,
e.g. they need to be measured in a lab, simulated or
observed in real time only, or at high time and
computational expenses. Empirical modelling attempts to
express these critical control variables via other
controllable variables that are easier to monitor, can
be measured more accurately or timely, are cheaper to
simulate, etc. Symbolic regression provides such
expressions of crucial process characteristics, or,
response variables, defined (symbolically) as
mathematical functions of some of the easy-to-measure
input variables, and calls these expressions empirical
input-response models. Examples of these are (i)
structure-activity relationships in pharmaceutical
research, which define the activity of a drug through
the physical structure of molecules of drug components,
(ii) structure-property relationships in material
science, which define product qualities, such as
shininess, opacity, smell, or stiffness through
physical properties of composites and processing
conditions, or (iii) economic models, e.g. expressing
return on investment through daily closes of S&P 500
quotes and in ation rates.",
abstract =     "Industrial modelling problems that are tractable for
symbolic regression have two main characteristics: (1)
No or little information is known about the underlying
system producing the data, and therefore no assumptions
on model structure can be made; (2) The available data
is high-dimensional, and often not balanced, with
either abundant or insufficient number of samples.

To discover plausible models with realistic time and
computational efforts, symbolic regression exploits a
stochastic iterative search technique, based on
artificial evolution of model expressions. This method,
called genetic programming looks for appropriate
expressions of the response variable in the space of
all valid formulae containing a minimal set of input
variables and a proposed set of basic operators and
constants.

At each step, the genetic programming system considers
a sufficiently large quantity of various formulae,
selects the subset of the best formulae according to
certain user-defined criteria of goodness, and
(re)combines the best formulae to create a rich set of
potential solutions for the next step. This approach is
inspired by principles of natural selection, where the
offspring that inherits good features from both parents
increases the chances to be successful in survival,
adaptation, and further propagation. The challenge and
the rationale of performing evolutionary search is to
balance the exploitation of the good solutions
discovered so far, with exploration of the new areas of
the search space, where even better solutions may be
found.",
abstract =     "The fact that symbolic regression via genetic
programming (GP) does not impose any assumptions on the
structure of the input-output models means that the
model structure is to a large extent determined by data
and also by selection objectives used in the
evolutionary search. On one hand, it is an advantage
and the unique capability compared with other global
approximation techniques, since it potentially allows
to develop inherently simpler models than, for example,
by interpolation with polynomials or spatial
correlation analysis. On the other hand, the absence of
constraints on model structure is the greatest
challenge for symbolic regression since it vastly
increases the search space of possible solutions which

A special multi-objective flavour of a genetic
programming search is considered, called Pareto GP.
Pareto GP used for symbolic regression has strong
advantages in creating diverse sets of regression
models, satisfying competing criteria of model
structural simplicity and model prediction accuracy.",
abstract =     "This thesis extends the Pareto genetic programming
methodology by additional generic model selection and
generation strategies that (1) drive the modelling
engine to creation of models of reduced non-linearity
and increased generalisation capabilities, and (2)
improve the effectiveness of the search for robust
models by goal softening, adaptive fitness evaluations,
and enhanced training strategies.

In addition to the new strategies for model development
and model selection, this dissertation presents a new
approach for analysis, ranking, and compression of
given multi-dimensional input-output data for the
purpose of balancing the information content in
undesigned data sets.

To present contributions of this research in the
context of real-life problem solving, the dissertation
exploits a generic framework of adaptive model-based
problem solving used in many industrial modelling
applications. This framework consists of an iterative
feed-back loop over: (Part I) data generation, analysis
and adaptation, (Part II) model development, and (Part
III) problem analysis and reduction.",
abstract =     "Part I of the thesis consists of Chapter 2 and is
devoted to data analysis. It studies the ways to
balance multi-dimensional input-output data for making
further modelling more successful. Chapter 2 proposes
several novel methods for interpretation and
manipulation of given high-dimensional input-output
data such as relative weighting the data, ranking the
data records in the order of increasing importance, and
accessing the compressibility and information content
of a multi-dimensional data set. All methods exploit
the geometrical structure of the data and relative
distances to nearest-in-the-input space neighbours. All
methods treat response values differently, assuming
that the data belongs to a response surface, which
needs to be identified.

Part II of the thesis consist of Chapters 3-7 and
addresses the model induction method - Pareto genetic
programming. Since time to solution, or, more
accurately, time-to-convincing-solution is a major
practical challenge of evolutionary search algorithms,
and Pareto GP in particular, Part II focuses on
algorithmic enhancements of Pareto GP that lead it to
the discovery of better solutions faster (i.e.
solutions of sufficient quality at a smaller
computational effort, or of considerably better quality
at the same computational effort).",
abstract =     "

In Chapter 3 a general description of the Pareto GP
methodology is presented in a framework of evolutionary
search, as an iterative loop over the stages of model
generation, model evaluation, and model selection.

In Chapter 5 a novel strategy for model selection
through explicit non-linearity control is presented. A
new complexity measure called the order of
non-linearity of symbolic models is introduced and used
BibTeX entry too long. Truncated

```

Genetic Programming entries for Ekaterina (Katya) Vladislavleva

Citations