Reinforcement learning with self-modifying policies

Created by W.Langdon from gp-bibliography.bib Revision:1.4420

  author =       "Juergen Schmidhuber and Jieyu Zhao and 
                 Nicol N. Schraudolph",
  title =        "Reinforcement learning with self-modifying policies",
  booktitle =    "Learning to learn",
  publisher =    "Kluwer",
  year =         "1997",
  editor =       "S. Thrun and L. Pratt",
  pages =        "293--309",
  keywords =     "genetic algorithms, genetic programming",
  URL =          "",
  URL =          "",
  abstract =     "A learner's modifiable components are called its
                 policy. An algorithm that modifies the policy is a
                 learning algorithm. If the learning algorithm has
                 modifiable components represented as part of the
                 policy, then we speak of a self-modifying policy (SMP).
                 SMPs can modify the way they modify themselves etc.
                 They are of interest in situations where the initial
                 learning algorithm itself can be improved by experience
                 -- this is what we call ``learning to learn''. How can
                 we force some (stochastic) SMP to trigger better and
                 better self-modifications? The success-story algorithm
                 (SSA) addresses this question in a lifelong
                 reinforcement learning context. During the learner's
                 life-time, SSA is occasionally called at times computed
                 according to SMP itself. SSA uses backtracking to undo
                 those SMP-generated SMP-modifications that have not
                 been empirically observed to trigger lifelong reward
                 accelerations (measured up until the current SSA call
                 -- this evaluates the long-term effects of
                 SMP-modifications setting the stage for later
                 SMP-modifications). SMP-modifications that survive SSA
                 represent a lifelong success history. Until the next
                 SSA call, they build the basis for additional
                 SMP-modifications. Solely by self-modifications our
                 SMP/SSA-based learners solve a complex task in a
                 partially observable environment (POE) whose state
                 space is far bigger than most reported in the POE

Genetic Programming entries for Jurgen Schmidhuber Jieyu Zhao Nicol N Schraudolph