On the Importance of Data Balancing for Symbolic Regression

Created by W.Langdon from gp-bibliography.bib Revision:1.3973

@Article{Vladislavleva:2010:ieeeTEC,
  author =       "Ekaterina Vladislavleva and Guido Smits and 
                 Dick {den Hertog}",
  title =        "On the Importance of Data Balancing for Symbolic
                 Regression",
  journal =      "IEEE Transactions on Evolutionary Computation",
  year =         "2010",
  volume =       "14",
  number =       "2",
  pages =        "252--277",
  month =        apr,
  keywords =     "genetic algorithms, genetic programming, Compression,
                 data balancing, data scoring, data weighting, fitting,
                 information content, modeling, subset selection,
                 symbolic regression",
  ISSN =         "1089-778X",
  DOI =          "doi:10.1109/TEVC.2009.2029697",
  size =         "26 pages",
  abstract =     "Symbolic regression of input-output data
                 conventionally treats data records equally. We suggest
                 a framework for automatic assignment of weights to data
                 samples, which takes into account the sample's relative
                 importance. In this paper, we study the possibilities
                 of improving symbolic regression on real-life data by
                 incorporating weights into the fitness function. We
                 introduce four weighting schemes defining the
                 importance of a point relative to proximity,
                 surrounding, remoteness, and nonlinear deviation from k
                 nearest-in-the-input-space neighbors. For enhanced
                 analysis and modeling of large imbalanced data sets we
                 introduce a simple multidimensional iterative technique
                 for subsampling. This technique allows a sensible
                 partitioning (and compression) of data to nested
                 subsets of an arbitrary size in such a way that the
                 subsets are balanced with respect to either of the
                 presented weighting schemes. For cases where a given
                 input output data set contains some redundancy, we
                 suggest an approach to considerably improve the
                 effectiveness of regression by applying more modeling
                 effort to a smaller subset of the data set that has a
                 similar information content. Such improvement is
                 achieved due to better exploration of the search space
                 of potential solutions at the same number of function
                 evaluations. We compare different approaches to
                 regression on five benchmark problems with a fixed
                 budget allocation. We demonstrate that the significant
                 improvement in the quality of the regression models can
                 be obtained either with the weighted regression,
                 exploratory regression using a compressed subset with a
                 similar information content, or exploratory weighted
                 regression on the compressed subset, which is weighted
                 with one of the proposed weighting schemes.",
  notes =        "also known as \cite{5325864}",
}

Genetic Programming entries for Ekaterina (Katya) Vladislavleva Guido F Smits Dick den Hertog

Citations