The University of Birmingham

Birmingham B15 2TT England

http://www.cs.bham.ac.uk/~mmk

**ABSTRACT**

Despite two decades of work learning classifier systems researchers have had relatively little to say on the subject of what makes a problem difficult for a classifier system. Wilson's accuracy-based XCS, a promising and increasingly popular classifier system, is, we feel, the natural first choice of classifier system with which to address this issue. To make the task more tractable we limit our considerations to a restricted, but very important, class of problems. Most significantly, we consider only single step reinforcement learning problems and the use of the standard binary/ternary classifier systems language. In addition to distinguishing several dimensions of problem complexity for XCS, we consider their interactions, identify bounding cases of difficulty, and consider complexity metrics for XCS. Based on these results we suggest a simple template for ternary single step test suites to more comprehensively evaluate classifier systems.

Two basic questions to ask about any learning system are: to what kinds of problems is it well suited? To what kinds of problems is it poorly suited? Despite two decades of work, Learning Classifier Systems (LCS) researchers have had relatively little to say on the subject. Although this may in part be due to the wide range of systems and problems the LCS paradigm encompasses, it is certainly a reflection of deficiency in LCS theory.

Delving into the subject more deeply, a host of other questions arise. What is it about a problem that makes it difficult or easy for a given system? In other words, what factors are involved in determining a problem's difficulty? How do these factors interact? What effects do the representation(s) used by the system have? What effects does the rule discovery system have? In short, we want to know about the dimensions of problem complexity for the system of interest. Clearly this is a difficult subject even for one particular learning system. At present we consider only Wilson's XCS [Wilson95a] classifier system, although we hope much of the approach and some of the results will transfer to other LCS. To simplify matters, we restrict consideration to the standard ternary LCS language, to a binary action space and to single step reinforcement learning problems.

One reason to ask questions like those above is, of course, to find out when we can hope to successfully apply a given system to a given problem. Another reason is to better understand how to evaluate a system. Since we can't test our systems on all possible problems, we must consider only a subset. How can we choose this subset? There are at least two conflicting criteria. First, the subset should maximise the coverage of the dimensions of difficulty in order to more fully represent the space of all possible problems. That is, it should include as many as possible of the features which make a problem difficult for the system in question. Otherwise the test set may not detect deficiencies in an algorithm which would become apparent on other problems. Second, we would like to minimise the number of tests which must be made in order to make testing more manageable. Optimising these two criteria requires a good understanding of the dimensions of problem complexity for the system in question.

We address questions of problem complexity by considering the space of all possible test functions (i.e. learning problems) given our various restrictions. Based on our insights into this space we suggest a simple ternary single step test suite for LCS, and provide some results for XCS on it. To begin with, however, we consider how to approach the study of problem complexity in XCS.

In this section we briefly motivate our study, consider some representational issues, and outline our approach.

Single step functions are those in which the LCS's actions have no
influence on which inputs it receives in the future. This contrasts with
sequential (also called multi step) functions, in which actions
do influence future inputs.*We'll refer to states and inputs
interchangeably.*
In previous work we have mainly studied single step functions and we continue
that practice here for two reasons. First, some applications, e.g. data
mining, are single step, so it is of interest to understand and to optimise LCS
for single step problems. Second, single step functions avoid many
complications of sequential ones. Even if we're interested only in sequential
problems, it seems reasonable to first evaluate, understand, and possibly
improve our systems in the simpler single step case. When we have a good
understanding of the basic LCS mechanisms we can go on to look at issues which
are specific to the sequential case. This seems easier than starting with the
more complex case and having to face a host of additional problems at the
outset. We would argue that present understanding of LCS is limited enough to
justify this approach. (Clearly, however, useful work is being done with
sequential tests, not all of it addressing problems exclusive to them.)

Of course we have to be careful when evaluating potential improvements to ensure that we're not overfitting; optimising performance in single step problems at the expense of sequential ones. If we are interested in sequential problems we need sequential tests in our test suite.

In Reinforcement Learning (RL), feedback to the learner about the value of its actions consists only of numeric values, called rewards, and the goal of any reinforcement learner is to maximise (some function of) the rewards it receives. Rewards are defined a priori by the experimenter in the form of a reward function, which (for our purposes) maps input/action pairs to integers. The reward function is an essential component of any RL problem specification; changing the reward function changes the problem.

As an aside, XCS - like most other reinforcement learners, but unlike traditional LCS - learns an approximation of the entire reward function. That is, every input/action pair is represented. Traditional LCS, in contrast, are only concerned with representing the more rewarding parts of this space [Wilson95a,Kovacs2000a].

Working with LCS, we often refer to test problems in terms of an input/output mapping. One example is the 6 multiplexer, which has often been used as an LCS test. Figure 1 shows the 3 multiplexer, a related but simpler function. This exhaustive listing of input/output cases is called a truth table.

The 3 multiplexer is defined on binary strings of length 3, and treats the string as being composed of an index segment (the first bit) and a data segment (the remaining two bits). The value of the function is the value of the indexed bit in the data segment, so, for example, the value of 000 is 0, the value of 001 is 0, the value of 101 is 1 and so on. Knowing the value of the string, we can do Supervised Learning (SL); that is, we know which action the LCS must respond with in order to be correct.

However, the input/output mapping of figure 1 alone is not a complete specification of an RL problem, since it does not specify rewards. In adapting a boolean function to an RL paradigm we need to extend it by defining a reward function. So the 3 multiplexer is not a complete RL problem - we need to extend it with a reward function, which we have done in figure 2. (Horizontal lines have been inserted between different inputs in this and other long figures simply as a visual aid.) Note that this figure refers to actions rather than the output of the function, because we are now dealing with a learning agent which acts (predicts the output of the function). Note also that we specify the reward for both the correct and incorrect action for each input, since we must specify a reward for each possible input/action pair. By "correct" we mean the action which receives the higher reward for that input.

Input |
Output |

000 |
0 |

001 |
0 |

010 |
1 |

011 |
1 |

100 |
0 |

101 |
1 |

110 |
0 |

111 |
1 |

Input |
Action |
Reward |

000 |
0 | 1000 |

000 |
1 | 0 |

001 |
0 | 1000 |

001 |
1 | 0 |

010 |
0 | 0 |

010 |
1 | 1000 |

011 |
0 | 0 |

011 |
1 | 1000 |

100 |
0 | 1000 |

100 |
1 | 0 |

101 |
0 | 0 |

101 |
1 | 1000 |

110 |
0 | 1000 |

110 |
1 | 0 |

111 |
0 | 0 |

111 |
1 | 1000 |

We associated certain rewards with input/action pairs in figure 2 but clearly could have used other values. Other rewards will produce a 3 multiplexer problem as long as the correct action in each state is the output specified for that state in figure 1. If this is not the case, we no longer have a 3 multiplexer problem since it is the rewards which determine what input/output mapping will be learnt.

Even when rewards are consistent with the 3 multiplexer input/output mapping, 3 multiplexer RL problems differ when their reward functions differ. We'll see an example of this shortly.

Input |
Action |
Reward |

00# |
0 | 1000 |

00# |
1 | 0 |

01# |
0 | 0 |

01# |
1 | 1000 |

1#0 |
0 | 1000 |

1#0 |
1 | 0 |

1#1 |
0 | 0 |

1#1 |
1 | 1000 |

Input |
Output |

000 |
1 |

001 |
0 |

010 |
0 |

011 |
1 |

100 |
0 |

101 |
1 |

110 |
1 |

111 |
0 |

In the original and standard representation used with XCS [Wilson95a], each
rule has a single fixed-length l-bit condition which is a string from
{0,1,#}^{l}, and single action which is a string from {0,1}^{a}. In this
work a=1. A condition c matches a binary input string i if the characters
in each position in c and i are the same, or if the character in c is a
#, so the # is the means by which conditions generalise over inputs. For
example, the condition 00# matches two inputs: 000 and 001. Actions do not
contain #s and so, using this representation, XCS cannot generalise over
actions.

The input/action/reward structure of the 3 multiplexer in figure 2 admits a number of accurate generalisations over inputs, and XCS seeks to find them. Figure 3 shows how XCS will learn to represent the problem defined in figure 2 using ternary classifier conditions to express generalisations over inputs. Using generalisation allows us (and XCS) to represent functions with fewer rules.

Notice that a table like figure 3 can be used in two ways, both as a specification of the function to be learned by XCS, and by XCS to representation its hypotheses about the function it is learning.

The amount of generalisation possible depends on the input/output mapping and the representation used. Consider the even 3 parity function, whose output is 1 when there are an even number of 1s in the input, and 0 otherwise (figure 4). While the 3 multiplexer admits considerable generalisation using the ternary LCS language, a parity function admits none whatsoever. That is, any condition which contains 1 or more #s is overgeneral.

Consequently, to represent the even 3 parity function (regardless of what reward function is associated with it) XCS must use the full set of rules with fully specific conditions (i.e., conditions which have no #s) of the appropriate length. Note that this set includes rules with identical conditions but different actions. The 3-bit version of the fully specific rule set (with different rewards) can be found in figures 2 and 5. Note that we can represent any 3-bit Boolean function with this same rule set by introducing an appropriate reward function.

Input |
Action |
Reward |

000 |
0 | 1000 |

000 |
1 | 0 |

001 |
0 | 1100 |

001 |
1 | 100 |

010 |
0 | 200 |

010 |
1 | 1200 |

011 |
0 | 300 |

011 |
1 | 1300 |

100 |
0 | 1400 |

100 |
1 | 400 |

101 |
0 | 500 |

101 |
1 | 1500 |

110 |
0 | 1600 |

110 |
1 | 600 |

111 |
0 | 700 |

111 |
1 | 1700 |

Input/output mappings and representations constrain what generalisation is
possible, and adding rewards to an input/output mapping can further constrain
what generalisations XCS can make. For example, if we choose a reward function
in which the rewards for each state are sufficiently*What constitutes
sufficiently different rewards depends on XCS's tolerance for
differences in rewards, which in turn depends on how XCS has been
parameterised.* different, as in figure 5, XCS will be
unable to generalise at all.*This reward function was constructed by
allocating a reward of 1000 for the correct action and 0 for the incorrect
action in the first (topmost) state and incrementing the rewards for correct
and incorrect actions by 100 for each subsequent state.* In this case XCS
needs to use the set of all rules with fully specific conditions to represent
the function. That is, XCS will need a rule for each row in the table in figure
5 (16 rules) rather than a rule for each row in figure
3 (8 rules).

Note that the altered reward function still returns a higher reward for the correct action in each state, and a lower reward for the incorrect action. For example, for input 000, action 0 receives more reward than action 1, and so is to be preferred by a system whose goal is to maximise the rewards it receives. This reward function is consistent with the input/output mapping we call a 3 multiplexer function. If we changed the rewards so that action 0 received less reward than action 1 for input 000 then we would no longer be dealing with a 3 multiplexer function.

Although the reward function in figure 5 is consistent with the 3 multiplexer function, from XCS's point of view, however, the change in rewards means the problem is equivalent to the 3-bit parity problem, since with the new reward function XCS cannot generalise accurately over inputs at all. That is, even though the input/output mapping remains that of a multiplexer problem, XCS cannot generalise and must represent each input/action pair individually, as in a parity problem. Thus the representational complexity of this particular 3 multiplexer problem is equivalent to that of a parity problem (again, assuming XCS is parameterised such that it cannot generalise). This demonstrates that referring to input/output functions (e.g. multiplexer and parity functions) can be misleading when we are really referring to RL problems.

To summarise, the representational complexity of a single step RL problem depends not only on the input/output mapping but on the rewards associated with actions, the representation used, and XCS's parameterisation.

Notice how we represented the 3 multiplexer in figure 3 using the language of classifier conditions to express generalisations over inputs. That is, each line in the table can be interpreted as a classifier, and the function can be represented by a set of classifiers. We could have used other sets of rules to represent the function, e.g., the set of fully specific rules used in figures 2 and 5. The set we used was chosen because it has certain properties. It is:

- [1. Complete.] The rule set maps the entire input/action space.
- [2. Accurate.] For our purposes this means each rule maps only to a single reward.
- [3. Non-Overlapping.] No input/action pair is described by more than one rule.
- [4. Minimal.] The rule set contains no more rules than are needed to satisfy the other three properties.

A set of rules with these 4 characteristics is called an optimal
population or optimal rule set, denoted [`O`] and pronounced as the
letter O [Kovacs97a]. We'll see in section 3.2
that it is interesting to know the size of an optimal population, denoted
|[`O`]|.

We find it convenient to represent test functions by their optimal populations because we can easily manipulate them to produce other optimal populations with their own corresponding functions. Working directly with the target representation makes it obvious what effects a transformation has.

XCS does not support Default Hierarchies (see, e.g., [Goldberg89c])
so we do not consider solutions involving them when calculating minimality.
Also, default hierarchies inherently violate the constraint that solutions be
composed of non-overlapping rules, so no solution containing a default hierarchy
can be an [`O`].

We can represent functions more compactly if we assume we have binary rewards,
that is, a reward function in which all correct actions receive the same reward
r_{1}, and all incorrect actions receive another reward r_{2}, where, again,
correct actions are simply those which return more reward, i.e. r_{1} > r_{2}.
This allows us to omit the reward column and the incorrect actions when
specifying a test function: if we know the correct action for a state, we know
what reward it will receive (r_{1}), and also what reward the other action in
that state will receive (r_{2}). (Alternatively, we can omit the correct actions
instead of the incorrect ones.) These conventions allow us to represent complete
RL problems in the input/output form of figures 1 and
4.

If we go further and also omit the action column we cannot specify unique
functions, but we can, using only conditions, specify classes of functions. The
advantage of this approach is slight, being only that we can omit actions in our
specification, and to obtain fully specified functions from this representation
requires some computation. To do so, we systematically assign correct actions to
the conditions in such a way as to avoid making it possible to replace
conditions with more general ones. Effectively, we require that the functions
obtained from the condition set all have the same number of rules in their
minimal representation (i.e., the same |[`O`]|). (We ignore the capacity of
non-binary reward functions to influence the minimal number of rules required,
since we continue to assume binary rewards.) Adding actions to conditions
yields fully specified input/output functions, which in turn specify full
input/action/reward mappings if we assume some binary reward function.

For example, we can assign 1s and 0s as correct actions to the conditions in
figure 6 as we like, as long as we avoid assigning the same
correct action to both 00 and 01, which would make it possible to replace them
with 0# and yield an [`O`] with only 2 rules, which consequently lies
outside the class denoted by figure 6 (because it contains 3
conditions). By assigning 1s and 0s to a set of conditions in different ways,
and avoiding the possibility of replacing conditions with more general ones, we
can obtain different functions, all of which have |[`O`]| = 4. The 4
[`O`]s which can be produced from the condition set in figure
6 are shown in figure 7.

This approach of specifying only conditions is used in figure 12. Note that we are using these more compact representations for the specification of test functions, but XCS is not using them to represent its hypotheses about the function it is learning.

00 |

01 |

1# |

Input |
Action |

00 |
0 |

01 |
1 |

1# |
0 |

Input |
Action |

00 |
1 |

01 |
0 |

1# |
0 |

Input |
Action |

00 |
0 |

01 |
1 |

1# |
1 |

Input |
Action |

00 |
1 |

01 |
0 |

1# |
1 |

So far we've discussed problem difficulty without defining its meaning.
Different measures are possible. The primary measure we'll use is [%`O`], the
proportion of the optimal population present in the classifier system on a given
time step, which is useful as a measure of the progress of genetic search. An
alternative metric, more commonly used with XCS, is that used by Wilson in,
e.g., [Wilson95a], simply called `performance'. This is defined as the
proportion of the last 50 inputs to which the system has responded correctly. It
measures the extent to which XCS has found a population of classifiers
constituting a solution to the problem. [%`O`], in contrast, measures the
extent to which XCS has found the optimal solution. The latter is
naturally more difficult to find, and requires more trials (inputs to the
system) to learn. Even after XCS has reached a point where it responds perfectly
correctly to all its inputs it still needs more time to find the optimal
solution.

We prefer the use of [%`O`] because it is more sensitive to the progress of
genetic search. As demonstrated in [Kovacs99a], [%`O`] can reveal
differences which do not show up using the other performance measure. [%`O`]
also seems a natural choice of metric given the discussion in section
3.2.

Another metric we'll use is the mean [%`O`] regret (or simply
regret) which is defined as the mean difference between 100% [`O`] and the
observed [%`O`] averaged over all trials and all runs. This corresponds to the
mean distance between the top of a [%`O`] graph and the [%`O`] curve. Perfect
performance would produce a [%`O`] regret of 0, while the worst possible
performance would produce a regret of 1.

We could consider the population size required to efficiently learn a function as another measure of its complexity. Different test functions - even those of the same string length - can have rather different population size requirements. Because of this, it is important to take population size into consideration in any comparison of different test functions. Otherwise, differences in performance may be due to the suitability of the population size used, rather than to some other feature of the problem. Different search mechanisms may have different population size requirements too, so population size should also be considered when comparing them.

Population sizing experiments on a range of 6-bit functions showed that in each case performance plateaued around a certain point. We chose to use a population size limit of 2000 rules for all experiments as all 6-bit functions should have plateaued by this point.

The tests used in the following sections followed the standard experimental setup defined in [Wilson95a] and subsequently used in many other studies of XCS. The essence of a trial is that XCS is presented with a randomly generated binary string of a fixed length as input, it responds with either a 0 or a 1, and receives a numerical reward as specified by the reward function.

Unless otherwise stated, the settings used were the standard ones for the 6 multiplexer from [Wilson95a]. We used the specify operator from [Lanzi97a], GA subsumption deletion [Wilson98a], a niche GA in the action sets [Wilson98a], and the t3 deletion scheme from [Kovacs99a] which protects newly generated rules (with a delay of 25 for all tests). This configuration represents our current best guess at a good implementation of XCS for the problems considered here (apart from the ommission of action set subsumption, which would probably generally improve performance). No attempt was made to optimise parameters for individual problems. Finally, we used uniform crossover rather than the 1 point crossover used in previous work with XCS in order to avoid any bias due to the length or position of building blocks [Goldberg89c] for the solution.

Now that we have outlined our approach, we can finally ask the question: what dimensions of single step problem difficulty are there for XCS?

Wilson briefly investigated complexity in [Wilson98a] using a series of
boolean multiplexer functions of string length 6, 11 and 20. The size of the
optimal populations |[`O`]| for these functions is 16, 32, and 64
respectively. By evaluating the number of inputs XCS needed to learn each
function, Wilson worked out an expression relating |[`O`]| and difficulty,
and another relating string length and difficulty [Wilson98a]. Not
surprisingly, functions of greater string length and |[`O`]| were more
difficult. Our experience with multiplexer-based functions of various
|[`O`]| [Kovacs96a] and multiplexer and parity functions
[Kovacs97a] indicates that larger |[`O`]| is more difficult even when
string length is constant. But, as we will soon see, string length and
|[`O`]| are not the only dimensions of problem complexity.

The lengths of the input and action strings determine the size of the space of rules XCS searches in, so it seems reasonable that difficulty would generally increase with string length. But if we consider the Kolmogorov complexity of binary strings we can see that it is quite possible for a long string to have low complexity while a much shorter string has high complexity. (See [Li97a] for an introduction to Kolmogorov complexity.) For example, it seems intuitive that a string of 10,000 zeroes has low complexity, while a binary string of 100 randomly distributed ones and zeroes has higher complexity. Technically this assumes the use of an appropriate language to represent the strings, but intuitively the longer string has much greater regularity than the shorter one. If the strings were computer files the longer string could be compressed into a smaller file than the shorter string.

Just as some strings are much more regular than others, some functions are much more regular than others (assuming a given language in both cases). If our language can capture the regularities in a function we can represent it much more compactly. We'll see in section 3.2 that how compactly we can represent a function correlates well with the difficulty XCS has in learning it, which means that string length does not have as great an effect on difficulty as we might think. That is, some functions of a given input length will be much easier than others, and will even be easier than some others of lesser input length.

It makes some intuitive sense that |[`O`]| would be a major factor in problem
complexity for XCS, because |[`O`]| is determined by how much useful
generalisation XCS can express, and XCS is largely concerned with finding
generalisations. Unlike earlier LCS, XCS was designed to learn a complete
mapping from state/action pairs to their values [Wilson95a,Kovacs2000a].
Further, it was designed with accurate generalisation over this mapping in mind,
and accurate generalisation has been a focus of XCS research from the beginning
(see
[Wilson95a,Kovacs96a,Kovacs97a,Kovacs97c,Lanzi97a,Wilson98a,Lanzi98c]).

Since an optimal population for a function is - assuming the ternary language
- a minimal representation of the function, |[`O`]| is a measure of its
complexity using this language. Thus we can think of |[`O`]| as a
representation-specific measure of Kolmogorov complexity.

Figure 8 shows the difficulty of a series of 6-bit
functions of increasing |[`O`]| (the series is defined in section
5). Difficulty increases with
|[`O`]|, but the rate of increase of difficulty slows as |[`O`]|
rises. That is, changes in |[`O`]| make less difference to difficulty at
higher values.

Why exactly does difficulty increase with |[`O`]|? One factor seems to be
that as |[`O`]| increases, the rules in the population become, on the whole,
more specific. More specific rules may be more difficult for XCS to find because
XCS updates rules and affords them opportunities to reproduce only when they
match an input. This has two consequences. First, more specific rules match
fewer inputs and so are updated less frequently, and reproduce less
frequently. The reproductive trial allocation scheme used in XCS (see
[Wilson95a]) balances out the difference between general and specific rules
to some extent, but specific rules still reproduce less than more general ones.
Now consider that a specific rule is likely to have been generated by the
reproduction of another specific rule. This means that specific rules are
generated less frequently because they reproduce less frequently.

The second reason more specific rules may be more difficult to find is that rules start with low fitness and only gain fitness as they are updated. It takes a number of updates for a rule to reach its full fitness, and more specific rules will take longer because they are updated less frequently. Because it takes such rules longer to reach their full fitness, they reproduce less than do more general rules.

A third reason is that genetic search in XCS seems to move mainly from specific to more general rules. Rules which are too general are unfit, whereas rules which are too specific are fit (although not as fit as rules which are as general as possible without being too general). So rules which are too general do not reproduce, while rules which are too specific do. If we think of a fitness landscape, more general rules have larger basins of attraction, while more specific rules have smaller ones. It should take longer for the GA to create a rule which falls in the basin of a specific rule - the GA seems inefficient at finding these rules. Lanzi noticed this problem and added a "specify" operator to XCS which detects the situation when all rules matching an input are overgeneral and creates a more specific version of a randomly chosen matching rule [Lanzi97a].

A final reason may be that as |[`O`]| increases, the proportion of inaccurate
to accurate rules increases in the space of all possible rules. To see this,
consider the extreme case of the constant function, in which all rules are
accurate. The 3-bit version of this function is shown as an [`O`] in figure
9. Its [`O`] has only 1 rule for each action, so it has the
smallest possible |[`O`]| of any function. Any other function will have some
inaccurate rules and a greater |[`O`]|.

As the proportion of accurate rules decreases, the proportion of reproductive
events which generate inaccurate rules should rise. If XCS does indeed search
mainly from accurate to more general accurate rules, the generation of new
inaccurate rules would contribute little to the search process. This would mean
that reproductive events which generate inaccurate rules contribute little to
search, so search would be slower on functions with more inaccurate rules, i.e.
those of greater |[`O`]|.

Input |
Action |
Reward |

### |
0 | 1000 |

### |
1 | 0 |

As we saw in section 2.2, the reward function can constrain what generalisations are possible. Apart from this, however, we expected the form of the reward function to have little effect on problem difficulty, because in XCS fitness is based on the accuracy with which a rule predicts rewards, and not on the magnitude of the reward prediction. However, results show the range of rewards - that is, the difference between the highest and lowest reward - has a potentially strong effect on problem difficulty.

Input |
Action |
Reward |

0000 |
0 | 100 |

0000 |
1 | 0 |

0001 |
0 | 0 |

0001 |
1 | 100 |

0010 |
0 | 0 |

0010 |
1 | 100 |

. . . |
. . . |
. . . |

0110 |
0 | 100 |

0110 |
1 | 0 |

1111 |
0 | x |

1111 |
1 | y |

To study the effect of reward range we used 4-bit parity functions in which the rewards for most of the function were binary and held constant, but the rewards for one state (1111) were varied (see figure 10). The population size limit was 500 and other settings were as defined in section 2.5 and used for the 6-bit tests shown in figure 8.

As the range in rewards increases, problem difficulty initially decreases, but then increases (see figure 11). To explain the means by which the reward range has its effects we must refer to the XCS update rules [Wilson95a], in particular the prediction error update:

`epsilon`_{j} <- `epsilon`_{j} +
`beta`([|R - p_{j}|]/[R_{max} - R_{min}] - `epsilon`_{j})

and the rule accuracy update:

kappa_{j} = {

1 | if epsilon_{j} <=q epsilon_{o} |

0.1e^{(ln alpha)( epsilonj - epsilono) /
epsilono} |
otherwise |

where `epsilon`_{j} is the prediction error of rule j, 0 < `beta` <=q 1 is
the learning rate, R is the reward, p_{j} is the prediction of rule j,
R_{max} and R_{min} are the highest and lowest rewards possible in any
state, kappa_{j} is the accuracy of rule j, `epsilon`_{o} is a constant
controlling the tolerance for prediction error and 0 < `alpha` < 1 is a constant
controlling the rate of decline in accuracy when the threshold `epsilon`_{o}
is exceeded (see [Wilson95a]).*Wilson has since changed the
accuracy update to:*

*
kappa _{j} = {
*

1 | if epsilon_{j} < epsilon_{o} |

alpha ( epsilon_{j} / epsilon_{o})^{-v} |
otherwise |

*
where 0 < v is another constant controlling the rate of decline in accuracy
when epsilon_{o} is exceeded.*

We can see that in the prediction error update the error between prediction and
reward |R - p_{j}| is normalised to fall between 0 and 1 by dividing it by the
range in rewards R_{max} - R_{min}. This means the error of a rule is
inversely proportional to reward range; the larger the range the smaller the
error.

The accuracy update says something like: the larger the error the lower the accuracy. Since fitness is based on accuracy, larger errors mean lower fitness. Putting all this together, larger reward ranges mean the fitness calculation is less sensitive to prediction error - it takes a larger error to produce the same effect. This means that, with larger reward ranges, XCS will often have more difficulty distinguishing accurate from inaccurate rules. We attribute the increase in difficulty as the reward range grows to this effect.

However, with larger reward ranges XCS will not always have more
difficulty distinguishing accurate from inaccurate rules. In the extreme case a
rule may be updated towards both R_{max} and R_{min}, and will have a large
error. The errors of other rules will, in comparison, be small. This may allow
XCS to more easily distinguish the overgenerality of some rules, and may account
for the initial decrease of problem difficulty as reward range increases. We
hypothesise that as reward range increases further this effect is then swamped
by the more significant effect (that XCS becomes less sensitive to prediction
error) and problem difficulty increases.

In order to avoid confounding effects of rewards, in this investigation we use only binary reward functions with the same reward range unless otherwise noted.

The hamming distance between two strings is defined as the number of characters
which must be changed to transform one into the other. The mean hamming distance
(MHD) of an [`O`] is the mean distance between each pair of condition/action
strings in the population, including comparison of a given string with itself.

To study the effect of MHD on difficulty we used the 4 functions represented in
figure 12. This figure shows 4 sets of conditions, each of
which can be transformed into a fully specified [`O`] by assigning alternating 0s
and 1s as the correct action for each condition, commencing with 0 as the
correct action for the topmost condition and working downwards. Correct actions
received a reward of 100 while incorrect actions received a reward of 0.

The 4 [`O`]s represented in figure 12 each have |[`O`]| = 8
and string length 6, but each has a different mean hamming distance.
Figure 13 shows that difficulty increases with MHD on
this set of functions.

H1 |
H2 |
H3 |
H4 | |

000 ### |
00# 0## |
00# 0## |
000 ### | |

001 ### |
00# 1## |
00# 1## |
001 ### | |

010 ### |
01# 0## |
01# 0## |
01# 0## | |

011 ### |
01# 1## |
01# 1## |
01# 1## | |

100 ### |
10# #0# |
10# #0# |
10# #0# | |

101 ### |
10# #1# |
10# #1# |
10# #1# | |

110 ### |
11# #0# |
11# ##0 |
11# ##0 | |

111 ### |
11# #1# |
11# ##1 |
11# ##1 | |

MHD |
2 | 2.75 | 2.9375 | 3.125 |

We hypothesise that this effect is partly due to the greater ease of
transforming more similar strings into each other by crossover and mutation. In
[`O`]s with shorter mean hamming distances it is easier to move from one accurate
general rule to another.

An additional factor may be involved. In section 3 we hypothesised that inaccurate rules contributed little to genetic search. If this is the case, mutation of an accurate rule into an inaccurate rule is a waste of effort, and slows the rate of genetic search. Rules which are more similar to other accurate rules are more likely to mutate into them, and less likely to mutate into inaccurate rules. Even if the accurate rule already exists, there may be more benefit in creating another copy of it than in creating an inaccurate rule. Optimal populations with smaller hamming distances between their rules are less likely to waste their efforts by producing inaccurate rules.

In the preceding sections we've identified several dimensions of single step problem complexity for XCS. In this section we consider some characteristics of the space of single step functions and how these dimensions structure it.

The space of single step functions grows rapidly with string length - there are
2^{2l} possible binary functions for a binary string of length l. We know
that some of these functions are more difficult for XCS than others. One
dimension which affects difficulty is |[`O`]| - results from section
3.2 show that difficulty increases with |[`O`]|. We
can use |[`O`]| to structure the space of l-bit functions: at one extreme,
with |[`O`]| maximised for a given l, we have a parity function. In this
case [`O`] consists of 2^{l} * 2 fully specific rules. (There are 2^{l} inputs
to match, and each maps to 2 actions.) At the other extreme, with |[`O`]|
minimised, we have a constant function. In this case [`O`] consists of 2 rules
with fully generalised conditions (since the fully generalised condition maps to
2 actions).

Although the parity and constant function bound |[`O`]| for a given string
length, they are atypical: of the 2^{2l} functions for a given l there are
only 2 parity and 2 constant functions. The vast majority of functions have
|[`O`]| somewhere between the two.

If |[`O`]| was the only dimension relevant to difficulty we would be
justified in stating that the difficulty of a function d([O_{1}]) is
greater than another d([O_{2}]) if its |[`O`]| is greater. That is, if
|[O_{1}]| > |[O_{2}]| then d([O_{1}]) > d([O_{2}]). This
would mean that, for l-bit functions, parity was the hardest and the constant
function the easiest. This would give us bounding cases on complexity for
l-bit functions and a unique ordering among them (by |[`O`]|). Further, it
would give us only 2^{l} * 2 complexity classes (sets of functions of
equivalent difficulty) in a much larger space of 2^{2l} functions. That is,
if we wanted to test XCS on all different levels of difficulty for l-bit
functions, we would only have to test 2^{l} * 2 rather than 2^{2l}
functions.

However, we know that |[`O`]| is not the only dimension of problem
difficulty. Let's consider the others we've identified. Mean hamming distance,
for a given string length l, covaries with |[`O`]|: the constant and parity
functions - the bounding cases for |[`O`]| of a given l - have fixed mean
hamming distances. MHD is only variable away from these extremes, so we need
only consider its effect away from them. For example, we need not worry about
whether MHD can make one parity function more difficulty than another, since
they must have the same MHD. This suggests that, unless MHD has a very strong
effect - and our studies suggest it does not - then complexity for l-bit
functions is indeed bounded by the constant and parity functions. This issue
deserves further study.

Unlike MHD, the reward range is independent of |[`O`]|: we can use any
rewards we like with any function. This suggests that in comparing the
complexity of functions, we should hold reward range constant, unless reward
range is itself the object of comparison, in which case we should hold all other
dimensions constant. This was the approach taken in section
3.3.

The above suggests that, for a given string length and reward range, |[`O`]|
may be a reasonable metric for problem difficulty, and that the idea of dividing
the space of functions into 2^{l} complexity classes defined by |[`O`]| is
also reasonable.

It is unfortunate that we have been unable to devise a more theoretically
satisfying model of complexity than the "|[`O`]| + noise" model proposed
above. However, it is perhaps not surprising given the complexity of XCS: the
classifier update rules, genetic algorithm, generalisation mechanism, deletion
scheme, triggered operators and other mechanisms all affect the system's
performance. While no simple precise model of all the above has been found, we
are pleased that a single dimension, |[`O`]|, provides a simple and seemingly
reasonable metric for problem difficulty. A somewhat more precise metric could
perhaps be devised by combining |[`O`]| and MHD, but we will not consider this
here.

What other dimensions of single step problem difficulty exist for XCS, and what
their significance is, remains to be seen. Because of this, it also remains to
be seen whether |[`O`]| is sufficient as a complexity metric.

In section 3.2 we noted that generalisation is an
important subject for XCS, and that |[`O`]| is a measure of the degree to
which generalisation is possible using the ternary language. We also saw that
|[`O`]| has a major effect on problem difficulty. In section
4 we saw how |[`O`]| can be used to
structure the space of functions of a given string length, and how using
|[`O`]| as a complexity metric divides the space of functions into a set of
complexity classes. There are many fewer complexity classes than functions,
which means we have to test XCS on only a small fraction of the function space
to evaluate it at all levels of difficulty. However, there seems little need to
go into such detail, particularly since higher values of |[`O`]| make
increasingly fine distinctions about difficulty.

Based on these observations, we propose a single step test suite which ranges
over the dimension of |[`O`]|, and - to the extent that |[`O`]| captures
problem difficulty - ranges over problem difficulty. The suite is generated for
l-bit strings as follows:

- [1] The first function in the series is the parity function for strings of
length l. This function allows no useful generalisation.
- [2] Obtain the next function by making one of the l bits in the string
irrelevant to the string's value. In effect we have a parity function for a
string of l-1 bits computed from a string of l bits. This function allows
XCS to generalise over the irrelevant bit.
- [3] Repeat step 2 to cumulatively make more bits irrelevant and to obtain more functions until we reach the constant function, in which all bits are irrelevant.

This algorithm yields a set of l+1 functions for l-bit strings. Recall that
the number of functions grows hyperexponentially with the input string, and that
the number of complexity classes defined by |[`O`]| grows exponentially with
it. Using this test suite, however, the number of tests we have to make grows
only linearly with the the input string. In other words, it scales well to
longer string lengths in terms of the effort required to perform the tests.

Note that this test suite is specific to the ternary LCS language, and not to XCS. That is, it may be used with any LCS, or indeed any system employing the ternary LCS language.

A disadvantage of the test suite is that it considers |[`O`]| as the only
dimension of problem difficulty. We would argue that the reward range can be
considered separately to |[`O`]| - we can use any reward range with the test
suite. We would also argue that the bounds on |[`O`]| provide bounds on MHD,
since there is no variation in MHD at the bounds of |[`O`]|. There is the
possibility that other, as yet unknown, dimensions of single step problem
difficulty exist for XCS. Note, however, that the algorithm for generating the
test suite does not specify how to select bits to ignore. By selecting bits in
different orders we end up with different versions of the test suite. To cater
for the possibility of unknown dimensions of problem complexity we could iterate
the suite generation algorithm many times to produce many suites and average the
results obtained from using them to test XCS.

The 6-bit tests shown in figure 8 were generated using this algorithm, with the leftmost relevant bit becoming irrelevant on each iteration of step 2.

We began with some methodological considerations, arguing that our approach of studying single step tasks is reasonable even if we're really interested in sequential ones. We then distinguished between the input/output functions we often speak of and the RL problems XCS is really applied to. Next we presented a way of representing RL problems which is particularly well suited to systems which use the ternary LCS language. Then we saw, for the first time in the literature, how population size affects performance in XCS and took measures to take its effect into account.

We've also taken some steps towards answering the questions posed at the start
of the paper. We've examined a number of dimensions of problem complexity, some
of them (reward range and MHD) previously unknown. We've illustrated how a
significant dimension, |[`O`]|, structures the space of functions and defines
complexity classes within it. Based on this we've presented a single step test
suite template that's simple to describe and implement, and which scales to any
length input string. We hope this test suite will prove useful, both by
improving the way we evaluate LCS, that is, through its use, and by spurring the
search for a better suite, and the knowledge needed to construct one.

The work begun here can be extended in many ways. To begin with, the search for additional dimensions of complexity for XCS seems important, as does evaluation of the many hypotheses introduced to account for the effects observed in section 3. To what extent our approach is appropriate for other LCS remains to be seen, as does their sensitivity to the complexity dimensions we've examined with XCS.

Finally, we've provided a great deal of additional empirical evidence to support
the suggestion in [Kovacs97a] that XCS reliably evolves [`O`]s for boolean
functions. We suspect that XCS can reliably learn any function from the
class studied here, given enough resources.

We'd like to thank the two anonymous reviewers for their helpful comments and Stewart Wilson for his support and inspiration over several years.

- [Goldberg89c]
Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine
Learning. Addison-Wesley, 1989.
- [Kovacs96a]
Kovacs, T.
Evolving Optimal Populations with XCS Classifier Systems.
MSc Thesis, University of Birmingham. Also
Technical Report CSR-96-17 and CSRP-96-17, School of Computer
Science, University of Birmingham, Birmingham, U.K., 1996.
- [Kovacs97c]
Kovacs, T.
Steady State Deletion Techniques in a Classifier System. Unpublished
PhD report, 1997.
- [Kovacs97a]
Kovacs, T.
XCS Classifier System Reliably Evolves Accurate, Complete, and
Minimal Representations for Boolean Functions.
In Roy, Chawdhry, and Pant, editors, Soft Computing in
Engineering Design and Manufacturing, pages 59-68. Springer-Verlag, 1997.
- [Kovacs99a]
Kovacs, T.
Deletion schemes for classifier systems.
In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar,
M. Jakiela, and R. E. Smith, editors, GECCO-99: Proceedings of the
Genetic and Evolutionary Computation Conference, pages 329-336. Morgan
Kaufmann, 1999.
- [Kovacs2000a]
Kovacs, T.
Strength or Accuracy? Fitness Calculation in Learning Classifier
Systems.
In P. L. Lanzi, W. Stolzmann, and S. W. Wilson,
editors, Learning Classifier Systems: An Introduction to Contemporary
Research, pages 143-160. Springer-Verlag, 2000.
- [Lanzi97a]
Lanzi, P. L.
A Study of the Generalization Capabilities of XCS. In Thomas Bäck, editor, Proceedings Seventh International
Conference on Genetic Algorithms (ICGA-7), pages 418-425. Morgan
Kaufmann, 1997.
- [Lanzi98c]
Lanzi, P. L.
Generalization in Wilson's XCS.
In A. E. Eiben, T. Bäck, M. Shoenauer, and H.-P. Schwefel, editors,
Proceedings of the Fifth International Conference on Parallel Problem
Solving From Nature, number 1498 in LNCS. Springer-Verlag, 1998.
- [Li97a]
Li, M. and Vitányi, P.
An Introduction to Kolmogorov Complexity and Its
Applications. 2nd edition. Springer-Verlag, 1997.
- [Wilson95a]
Wilson, S. W.
Classifier fitness based on accuracy.
Evolutionary Computation, 3(2):149-175, 1995.
- [Wilson98a]
Wilson, S. W.
Generalization in the XCS classifier system.
In J. Koza et al., editors, Genetic Programming 1998:
Proceedings of the Third Annual Conference, pages 665-674. Morgan Kaufmann, 1998.

© 2001 Springer, LNAI 1996. Tim Kovacs and Manfred Kerber

The URL of this page is http://www.cs.bham.ac.uk/~mmk/papers/01-IWLCS.html.

Also available as pdf ps.gz bib .