What Makes a Problem Hard for XCS?

Tim Kovacs

Manfred Kerber

School of Computer Science
The University of Birmingham
Birmingham B15 2TT England
http://www.cs.bham.ac.uk/~mmk

ABSTRACT

Despite two decades of work learning classifier systems researchers have had relatively little to say on the subject of what makes a problem difficult for a classifier system. Wilson's accuracy-based XCS, a promising and increasingly popular classifier system, is, we feel, the natural first choice of classifier system with which to address this issue. To make the task more tractable we limit our considerations to a restricted, but very important, class of problems. Most significantly, we consider only single step reinforcement learning problems and the use of the standard binary/ternary classifier systems language. In addition to distinguishing several dimensions of problem complexity for XCS, we consider their interactions, identify bounding cases of difficulty, and consider complexity metrics for XCS. Based on these results we suggest a simple template for ternary single step test suites to more comprehensively evaluate classifier systems.

1.  Introduction

Two basic questions to ask about any learning system are: to what kinds of problems is it well suited? To what kinds of problems is it poorly suited? Despite two decades of work, Learning Classifier Systems (LCS) researchers have had relatively little to say on the subject. Although this may in part be due to the wide range of systems and problems the LCS paradigm encompasses, it is certainly a reflection of deficiency in LCS theory.

Delving into the subject more deeply, a host of other questions arise. What is it about a problem that makes it difficult or easy for a given system? In other words, what factors are involved in determining a problem's difficulty? How do these factors interact? What effects do the representation(s) used by the system have? What effects does the rule discovery system have? In short, we want to know about the dimensions of problem complexity for the system of interest. Clearly this is a difficult subject even for one particular learning system. At present we consider only Wilson's XCS [Wilson95a] classifier system, although we hope much of the approach and some of the results will transfer to other LCS. To simplify matters, we restrict consideration to the standard ternary LCS language, to a binary action space and to single step reinforcement learning problems.

One reason to ask questions like those above is, of course, to find out when we can hope to successfully apply a given system to a given problem. Another reason is to better understand how to evaluate a system. Since we can't test our systems on all possible problems, we must consider only a subset. How can we choose this subset? There are at least two conflicting criteria. First, the subset should maximise the coverage of the dimensions of difficulty in order to more fully represent the space of all possible problems. That is, it should include as many as possible of the features which make a problem difficult for the system in question. Otherwise the test set may not detect deficiencies in an algorithm which would become apparent on other problems. Second, we would like to minimise the number of tests which must be made in order to make testing more manageable. Optimising these two criteria requires a good understanding of the dimensions of problem complexity for the system in question.

We address questions of problem complexity by considering the space of all possible test functions (i.e. learning problems) given our various restrictions. Based on our insights into this space we suggest a simple ternary single step test suite for LCS, and provide some results for XCS on it. To begin with, however, we consider how to approach the study of problem complexity in XCS.

2.  Methodological Considerations

In this section we briefly motivate our study, consider some representational issues, and outline our approach.

2.1.  Why Study Single Step Tests?

Single step functions are those in which the LCS's actions have no influence on which inputs it receives in the future. This contrasts with sequential (also called multi step) functions, in which actions do influence future inputs.We'll refer to states and inputs interchangeably. In previous work we have mainly studied single step functions and we continue that practice here for two reasons. First, some applications, e.g. data mining, are single step, so it is of interest to understand and to optimise LCS for single step problems. Second, single step functions avoid many complications of sequential ones. Even if we're interested only in sequential problems, it seems reasonable to first evaluate, understand, and possibly improve our systems in the simpler single step case. When we have a good understanding of the basic LCS mechanisms we can go on to look at issues which are specific to the sequential case. This seems easier than starting with the more complex case and having to face a host of additional problems at the outset. We would argue that present understanding of LCS is limited enough to justify this approach. (Clearly, however, useful work is being done with sequential tests, not all of it addressing problems exclusive to them.)

Of course we have to be careful when evaluating potential improvements to ensure that we're not overfitting; optimising performance in single step problems at the expense of sequential ones. If we are interested in sequential problems we need sequential tests in our test suite.

2.2.  Representing and Manipulating Functions

In Reinforcement Learning (RL), feedback to the learner about the value of its actions consists only of numeric values, called rewards, and the goal of any reinforcement learner is to maximise (some function of) the rewards it receives. Rewards are defined a priori by the experimenter in the form of a reward function, which (for our purposes) maps input/action pairs to integers. The reward function is an essential component of any RL problem specification; changing the reward function changes the problem.

As an aside, XCS - like most other reinforcement learners, but unlike traditional LCS - learns an approximation of the entire reward function. That is, every input/action pair is represented. Traditional LCS, in contrast, are only concerned with representing the more rewarding parts of this space [Wilson95a,Kovacs2000a].

Specification of SL and RL Problems.

Working with LCS, we often refer to test problems in terms of an input/output mapping. One example is the 6 multiplexer, which has often been used as an LCS test. Figure 1 shows the 3 multiplexer, a related but simpler function. This exhaustive listing of input/output cases is called a truth table.

The 3 multiplexer is defined on binary strings of length 3, and treats the string as being composed of an index segment (the first bit) and a data segment (the remaining two bits). The value of the function is the value of the indexed bit in the data segment, so, for example, the value of 000 is 0, the value of 001 is 0, the value of 101 is 1 and so on. Knowing the value of the string, we can do Supervised Learning (SL); that is, we know which action the LCS must respond with in order to be correct.

However, the input/output mapping of figure 1 alone is not a complete specification of an RL problem, since it does not specify rewards. In adapting a boolean function to an RL paradigm we need to extend it by defining a reward function. So the 3 multiplexer is not a complete RL problem - we need to extend it with a reward function, which we have done in figure 2. (Horizontal lines have been inserted between different inputs in this and other long figures simply as a visual aid.) Note that this figure refers to actions rather than the output of the function, because we are now dealing with a learning agent which acts (predicts the output of the function). Note also that we specify the reward for both the correct and incorrect action for each input, since we must specify a reward for each possible input/action pair. By "correct" we mean the action which receives the higher reward for that input.

  Input   Output
  000   0
  001   0
  010   1
  011   1
  100   0
  101   1
  110   0
  111   1
1. The 3 multiplexer function.

         
  Input   Action   Reward
  000   0   1000
  000   1   0
  001   0   1000
  001   1   0
  010   0   0
  010   1   1000
  011   0   0
  011   1   1000
  100   0   1000
  100   1   0
  101   0   0
  101   1   1000
  110   0   1000
  110   1   0
  111   0   0
  111   1   1000
2. The 3 multiplexer, with one possible reward function.

In RL Rewards Determine Input/Output Mappings.

We associated certain rewards with input/action pairs in figure 2 but clearly could have used other values. Other rewards will produce a 3 multiplexer problem as long as the correct action in each state is the output specified for that state in figure 1. If this is not the case, we no longer have a 3 multiplexer problem since it is the rewards which determine what input/output mapping will be learnt.

Even when rewards are consistent with the 3 multiplexer input/output mapping, 3 multiplexer RL problems differ when their reward functions differ. We'll see an example of this shortly.

  Input   Action   Reward
  00#   0   1000
  00#   1   0
  01#   0   0
  01#   1   1000
  1#0   0   1000
  1#0   1   0
  1#1   0   0
  1#1   1   1000
3. The 3 multiplexer, with rewards and generalisations expressed with the ternary syntax.

         
  Input   Output
  000   1
  001   0
  010   0
  011   1
  100   0
  101   1
  110   1
  111   0
4. The even 3 parity function.

Representing Generalisations.

In the original and standard representation used with XCS [Wilson95a], each rule has a single fixed-length l-bit condition which is a string from {0,1,#}l, and single action which is a string from {0,1}a. In this work a=1. A condition c matches a binary input string i if the characters in each position in c and i are the same, or if the character in c is a #, so the # is the means by which conditions generalise over inputs. For example, the condition 00# matches two inputs: 000 and 001. Actions do not contain #s and so, using this representation, XCS cannot generalise over actions.

The input/action/reward structure of the 3 multiplexer in figure 2 admits a number of accurate generalisations over inputs, and XCS seeks to find them. Figure 3 shows how XCS will learn to represent the problem defined in figure 2 using ternary classifier conditions to express generalisations over inputs. Using generalisation allows us (and XCS) to represent functions with fewer rules.

Notice that a table like figure 3 can be used in two ways, both as a specification of the function to be learned by XCS, and by XCS to representation its hypotheses about the function it is learning.

Input/Output Functions Constrain Generalisation.

The amount of generalisation possible depends on the input/output mapping and the representation used. Consider the even 3 parity function, whose output is 1 when there are an even number of 1s in the input, and 0 otherwise (figure 4). While the 3 multiplexer admits considerable generalisation using the ternary LCS language, a parity function admits none whatsoever. That is, any condition which contains 1 or more #s is overgeneral.

Consequently, to represent the even 3 parity function (regardless of what reward function is associated with it) XCS must use the full set of rules with fully specific conditions (i.e., conditions which have no #s) of the appropriate length. Note that this set includes rules with identical conditions but different actions. The 3-bit version of the fully specific rule set (with different rewards) can be found in figures 2 and 5. Note that we can represent any 3-bit Boolean function with this same rule set by introducing an appropriate reward function.

  Input   Action   Reward
  000   0   1000
  000   1   0
  001   0   1100
  001   1   100
  010   0   200
  010   1   1200
  011   0   300
  011   1   1300
  100   0   1400
  100   1   400
  101   0   500
  101   1   1500
  110   0   1600
  110   1   600
  111   0   700
  111   1   1700
5. The 3 multiplexer with rewards which may cause XCS to learn to represent it without generalisation.

Rewards Can Constrain Generalisation.

Input/output mappings and representations constrain what generalisation is possible, and adding rewards to an input/output mapping can further constrain what generalisations XCS can make. For example, if we choose a reward function in which the rewards for each state are sufficientlyWhat constitutes sufficiently different rewards depends on XCS's tolerance for differences in rewards, which in turn depends on how XCS has been parameterised. different, as in figure 5, XCS will be unable to generalise at all.This reward function was constructed by allocating a reward of 1000 for the correct action and 0 for the incorrect action in the first (topmost) state and incrementing the rewards for correct and incorrect actions by 100 for each subsequent state. In this case XCS needs to use the set of all rules with fully specific conditions to represent the function. That is, XCS will need a rule for each row in the table in figure 5 (16 rules) rather than a rule for each row in figure 3 (8 rules).

Note that the altered reward function still returns a higher reward for the correct action in each state, and a lower reward for the incorrect action. For example, for input 000, action 0 receives more reward than action 1, and so is to be preferred by a system whose goal is to maximise the rewards it receives. This reward function is consistent with the input/output mapping we call a 3 multiplexer function. If we changed the rewards so that action 0 received less reward than action 1 for input 000 then we would no longer be dealing with a 3 multiplexer function.

Although the reward function in figure 5 is consistent with the 3 multiplexer function, from XCS's point of view, however, the change in rewards means the problem is equivalent to the 3-bit parity problem, since with the new reward function XCS cannot generalise accurately over inputs at all. That is, even though the input/output mapping remains that of a multiplexer problem, XCS cannot generalise and must represent each input/action pair individually, as in a parity problem. Thus the representational complexity of this particular 3 multiplexer problem is equivalent to that of a parity problem (again, assuming XCS is parameterised such that it cannot generalise). This demonstrates that referring to input/output functions (e.g. multiplexer and parity functions) can be misleading when we are really referring to RL problems.

To summarise, the representational complexity of a single step RL problem depends not only on the input/output mapping but on the rewards associated with actions, the representation used, and XCS's parameterisation.

Optimal Rule Sets.

Notice how we represented the 3 multiplexer in figure 3 using the language of classifier conditions to express generalisations over inputs. That is, each line in the table can be interpreted as a classifier, and the function can be represented by a set of classifiers. We could have used other sets of rules to represent the function, e.g., the set of fully specific rules used in figures 2 and 5. The set we used was chosen because it has certain properties. It is:

A set of rules with these 4 characteristics is called an optimal population or optimal rule set, denoted [O] and pronounced as the letter O [Kovacs97a]. We'll see in section 3.2 that it is interesting to know the size of an optimal population, denoted |[O]|.

We find it convenient to represent test functions by their optimal populations because we can easily manipulate them to produce other optimal populations with their own corresponding functions. Working directly with the target representation makes it obvious what effects a transformation has.

Minimality and Default Hierarchies.

XCS does not support Default Hierarchies (see, e.g., [Goldberg89c]) so we do not consider solutions involving them when calculating minimality. Also, default hierarchies inherently violate the constraint that solutions be composed of non-overlapping rules, so no solution containing a default hierarchy can be an [O].

More Compact Representations.

We can represent functions more compactly if we assume we have binary rewards, that is, a reward function in which all correct actions receive the same reward r1, and all incorrect actions receive another reward r2, where, again, correct actions are simply those which return more reward, i.e. r1 > r2. This allows us to omit the reward column and the incorrect actions when specifying a test function: if we know the correct action for a state, we know what reward it will receive (r1), and also what reward the other action in that state will receive (r2). (Alternatively, we can omit the correct actions instead of the incorrect ones.) These conventions allow us to represent complete RL problems in the input/output form of figures 1 and 4.

If we go further and also omit the action column we cannot specify unique functions, but we can, using only conditions, specify classes of functions. The advantage of this approach is slight, being only that we can omit actions in our specification, and to obtain fully specified functions from this representation requires some computation. To do so, we systematically assign correct actions to the conditions in such a way as to avoid making it possible to replace conditions with more general ones. Effectively, we require that the functions obtained from the condition set all have the same number of rules in their minimal representation (i.e., the same |[O]|). (We ignore the capacity of non-binary reward functions to influence the minimal number of rules required, since we continue to assume binary rewards.) Adding actions to conditions yields fully specified input/output functions, which in turn specify full input/action/reward mappings if we assume some binary reward function.

For example, we can assign 1s and 0s as correct actions to the conditions in figure 6 as we like, as long as we avoid assigning the same correct action to both 00 and 01, which would make it possible to replace them with 0# and yield an [O] with only 2 rules, which consequently lies outside the class denoted by figure 6 (because it contains 3 conditions). By assigning 1s and 0s to a set of conditions in different ways, and avoiding the possibility of replacing conditions with more general ones, we can obtain different functions, all of which have |[O]| = 4. The 4 [O]s which can be produced from the condition set in figure 6 are shown in figure 7.

This approach of specifying only conditions is used in figure 12. Note that we are using these more compact representations for the specification of test functions, but XCS is not using them to represent its hypotheses about the function it is learning.

  00
  01
  1#
6. A set of conditions used to specify a class of functions.

  Input   Action
  00   0
  01   1
  1#   0
    
  Input   Action
  00   1
  01   0
  1#   0
    
  Input   Action
  00   0
  01   1
  1#   1
    
  Input   Action
  00   1
  01   0
  1#   1
7. The four [O]s represented by the conditions in figure 6.

2.3.  Measuring Problem Difficulty

So far we've discussed problem difficulty without defining its meaning. Different measures are possible. The primary measure we'll use is [%O], the proportion of the optimal population present in the classifier system on a given time step, which is useful as a measure of the progress of genetic search. An alternative metric, more commonly used with XCS, is that used by Wilson in, e.g., [Wilson95a], simply called `performance'. This is defined as the proportion of the last 50 inputs to which the system has responded correctly. It measures the extent to which XCS has found a population of classifiers constituting a solution to the problem. [%O], in contrast, measures the extent to which XCS has found the optimal solution. The latter is naturally more difficult to find, and requires more trials (inputs to the system) to learn. Even after XCS has reached a point where it responds perfectly correctly to all its inputs it still needs more time to find the optimal solution.

We prefer the use of [%O] because it is more sensitive to the progress of genetic search. As demonstrated in [Kovacs99a], [%O] can reveal differences which do not show up using the other performance measure. [%O] also seems a natural choice of metric given the discussion in section 3.2.

Another metric we'll use is the mean [%O] regret (or simply regret) which is defined as the mean difference between 100% [O] and the observed [%O] averaged over all trials and all runs. This corresponds to the mean distance between the top of a [%O] graph and the [%O] curve. Perfect performance would produce a [%O] regret of 0, while the worst possible performance would produce a regret of 1.

2.4.  Population Sizing

We could consider the population size required to efficiently learn a function as another measure of its complexity. Different test functions - even those of the same string length - can have rather different population size requirements. Because of this, it is important to take population size into consideration in any comparison of different test functions. Otherwise, differences in performance may be due to the suitability of the population size used, rather than to some other feature of the problem. Different search mechanisms may have different population size requirements too, so population size should also be considered when comparing them.

Population sizing experiments on a range of 6-bit functions showed that in each case performance plateaued around a certain point. We chose to use a population size limit of 2000 rules for all experiments as all 6-bit functions should have plateaued by this point.

2.5.  Experimental Procedure

The tests used in the following sections followed the standard experimental setup defined in [Wilson95a] and subsequently used in many other studies of XCS. The essence of a trial is that XCS is presented with a randomly generated binary string of a fixed length as input, it responds with either a 0 or a 1, and receives a numerical reward as specified by the reward function.

Unless otherwise stated, the settings used were the standard ones for the 6 multiplexer from [Wilson95a]. We used the specify operator from [Lanzi97a], GA subsumption deletion [Wilson98a], a niche GA in the action sets [Wilson98a], and the t3 deletion scheme from [Kovacs99a] which protects newly generated rules (with a delay of 25 for all tests). This configuration represents our current best guess at a good implementation of XCS for the problems considered here (apart from the ommission of action set subsumption, which would probably generally improve performance). No attempt was made to optimise parameters for individual problems. Finally, we used uniform crossover rather than the 1 point crossover used in previous work with XCS in order to avoid any bias due to the length or position of building blocks [Goldberg89c] for the solution.

3.  Dimensions of Problem Difficulty

Now that we have outlined our approach, we can finally ask the question: what dimensions of single step problem difficulty are there for XCS?

Wilson briefly investigated complexity in [Wilson98a] using a series of boolean multiplexer functions of string length 6, 11 and 20. The size of the optimal populations |[O]| for these functions is 16, 32, and 64 respectively. By evaluating the number of inputs XCS needed to learn each function, Wilson worked out an expression relating |[O]| and difficulty, and another relating string length and difficulty [Wilson98a]. Not surprisingly, functions of greater string length and |[O]| were more difficult. Our experience with multiplexer-based functions of various |[O]| [Kovacs96a] and multiplexer and parity functions [Kovacs97a] indicates that larger |[O]| is more difficult even when string length is constant. But, as we will soon see, string length and |[O]| are not the only dimensions of problem complexity.

3.1.  String Length

The lengths of the input and action strings determine the size of the space of rules XCS searches in, so it seems reasonable that difficulty would generally increase with string length. But if we consider the Kolmogorov complexity of binary strings we can see that it is quite possible for a long string to have low complexity while a much shorter string has high complexity. (See [Li97a] for an introduction to Kolmogorov complexity.) For example, it seems intuitive that a string of 10,000 zeroes has low complexity, while a binary string of 100 randomly distributed ones and zeroes has higher complexity. Technically this assumes the use of an appropriate language to represent the strings, but intuitively the longer string has much greater regularity than the shorter one. If the strings were computer files the longer string could be compressed into a smaller file than the shorter string.

Just as some strings are much more regular than others, some functions are much more regular than others (assuming a given language in both cases). If our language can capture the regularities in a function we can represent it much more compactly. We'll see in section 3.2 that how compactly we can represent a function correlates well with the difficulty XCS has in learning it, which means that string length does not have as great an effect on difficulty as we might think. That is, some functions of a given input length will be much easier than others, and will even be easier than some others of lesser input length.

3.2.  |[O]|

It makes some intuitive sense that |[O]| would be a major factor in problem complexity for XCS, because |[O]| is determined by how much useful generalisation XCS can express, and XCS is largely concerned with finding generalisations. Unlike earlier LCS, XCS was designed to learn a complete mapping from state/action pairs to their values [Wilson95a,Kovacs2000a]. Further, it was designed with accurate generalisation over this mapping in mind, and accurate generalisation has been a focus of XCS research from the beginning (see [Wilson95a,Kovacs96a,Kovacs97a,Kovacs97c,Lanzi97a,Wilson98a,Lanzi98c]).

Since an optimal population for a function is - assuming the ternary language - a minimal representation of the function, |[O]| is a measure of its complexity using this language. Thus we can think of |[O]| as a representation-specific measure of Kolmogorov complexity.

8. [%O] for 6-bit functions with |[O]| equal to 2, 4, 8, 16, 32, 64 and 128. Curves are averages of 1000 runs.

Figure 8 shows the difficulty of a series of 6-bit functions of increasing |[O]| (the series is defined in section 5). Difficulty increases with |[O]|, but the rate of increase of difficulty slows as |[O]| rises. That is, changes in |[O]| make less difference to difficulty at higher values.

Why exactly does difficulty increase with |[O]|? One factor seems to be that as |[O]| increases, the rules in the population become, on the whole, more specific. More specific rules may be more difficult for XCS to find because XCS updates rules and affords them opportunities to reproduce only when they match an input. This has two consequences. First, more specific rules match fewer inputs and so are updated less frequently, and reproduce less frequently. The reproductive trial allocation scheme used in XCS (see [Wilson95a]) balances out the difference between general and specific rules to some extent, but specific rules still reproduce less than more general ones. Now consider that a specific rule is likely to have been generated by the reproduction of another specific rule. This means that specific rules are generated less frequently because they reproduce less frequently.

The second reason more specific rules may be more difficult to find is that rules start with low fitness and only gain fitness as they are updated. It takes a number of updates for a rule to reach its full fitness, and more specific rules will take longer because they are updated less frequently. Because it takes such rules longer to reach their full fitness, they reproduce less than do more general rules.

A third reason is that genetic search in XCS seems to move mainly from specific to more general rules. Rules which are too general are unfit, whereas rules which are too specific are fit (although not as fit as rules which are as general as possible without being too general). So rules which are too general do not reproduce, while rules which are too specific do. If we think of a fitness landscape, more general rules have larger basins of attraction, while more specific rules have smaller ones. It should take longer for the GA to create a rule which falls in the basin of a specific rule - the GA seems inefficient at finding these rules. Lanzi noticed this problem and added a "specify" operator to XCS which detects the situation when all rules matching an input are overgeneral and creates a more specific version of a randomly chosen matching rule [Lanzi97a].

A final reason may be that as |[O]| increases, the proportion of inaccurate to accurate rules increases in the space of all possible rules. To see this, consider the extreme case of the constant function, in which all rules are accurate. The 3-bit version of this function is shown as an [O] in figure 9. Its [O] has only 1 rule for each action, so it has the smallest possible |[O]| of any function. Any other function will have some inaccurate rules and a greater |[O]|.

As the proportion of accurate rules decreases, the proportion of reproductive events which generate inaccurate rules should rise. If XCS does indeed search mainly from accurate to more general accurate rules, the generation of new inaccurate rules would contribute little to the search process. This would mean that reproductive events which generate inaccurate rules contribute little to search, so search would be slower on functions with more inaccurate rules, i.e. those of greater |[O]|.

  Input   Action   Reward
  ###   0   1000
  ###   1   0
9. The 3-bit constant function, with one possible reward function.

3.3.  The Reward Function

As we saw in section 2.2, the reward function can constrain what generalisations are possible. Apart from this, however, we expected the form of the reward function to have little effect on problem difficulty, because in XCS fitness is based on the accuracy with which a rule predicts rewards, and not on the magnitude of the reward prediction. However, results show the range of rewards - that is, the difference between the highest and lowest reward - has a potentially strong effect on problem difficulty.

 Input   Action   Reward
 0000   0   100
 0000   1   0
 0001   0   0
 0001   1   100
 0010   0   0
 0010   1   100
 
.
.
.
 
.
.
.
 
.
.
.
 0110   0   100
 0110   1   0
 1111   0   x
 1111   1   y
10. The 4-bit parity function, with rewards.

To study the effect of reward range we used 4-bit parity functions in which the rewards for most of the function were binary and held constant, but the rewards for one state (1111) were varied (see figure 10). The population size limit was 500 and other settings were as defined in section 2.5 and used for the 6-bit tests shown in figure 8.

As the range in rewards increases, problem difficulty initially decreases, but then increases (see figure 11). To explain the means by which the reward range has its effects we must refer to the XCS update rules [Wilson95a], in particular the prediction error update:

epsilonj <- epsilonj + beta([|R - pj|]/[Rmax - Rmin] - epsilonj)

and the rule accuracy update:

kappaj = {

  1   if epsilonj <=q epsilono
  0.1e(ln alpha)( epsilonj - epsilono) / epsilono        otherwise
.

where epsilonj is the prediction error of rule j, 0 < beta <=q 1 is the learning rate, R is the reward, pj is the prediction of rule j, Rmax and Rmin are the highest and lowest rewards possible in any state, kappaj is the accuracy of rule j, epsilono is a constant controlling the tolerance for prediction error and 0 < alpha < 1 is a constant controlling the rate of decline in accuracy when the threshold epsilono is exceeded (see [Wilson95a]).Wilson has since changed the accuracy update to:

kappaj = {

  1   if epsilonj < epsilono
  alpha ( epsilonj / epsilono)-v        otherwise
.

where 0 < v is another constant controlling the rate of decline in accuracy when epsilono is exceeded.

11. Mean [%O] regret with 95% confidence intervals vs. reward range on the 4-bit parity problem (figure 10). Curves are averages of 400 runs.

We can see that in the prediction error update the error between prediction and reward |R - pj| is normalised to fall between 0 and 1 by dividing it by the range in rewards Rmax - Rmin. This means the error of a rule is inversely proportional to reward range; the larger the range the smaller the error.

The accuracy update says something like: the larger the error the lower the accuracy. Since fitness is based on accuracy, larger errors mean lower fitness. Putting all this together, larger reward ranges mean the fitness calculation is less sensitive to prediction error - it takes a larger error to produce the same effect. This means that, with larger reward ranges, XCS will often have more difficulty distinguishing accurate from inaccurate rules. We attribute the increase in difficulty as the reward range grows to this effect.

However, with larger reward ranges XCS will not always have more difficulty distinguishing accurate from inaccurate rules. In the extreme case a rule may be updated towards both Rmax and Rmin, and will have a large error. The errors of other rules will, in comparison, be small. This may allow XCS to more easily distinguish the overgenerality of some rules, and may account for the initial decrease of problem difficulty as reward range increases. We hypothesise that as reward range increases further this effect is then swamped by the more significant effect (that XCS becomes less sensitive to prediction error) and problem difficulty increases.

In order to avoid confounding effects of rewards, in this investigation we use only binary reward functions with the same reward range unless otherwise noted.

3.4.  Mean Hamming Distance

The hamming distance between two strings is defined as the number of characters which must be changed to transform one into the other. The mean hamming distance (MHD) of an [O] is the mean distance between each pair of condition/action strings in the population, including comparison of a given string with itself.

To study the effect of MHD on difficulty we used the 4 functions represented in figure 12. This figure shows 4 sets of conditions, each of which can be transformed into a fully specified [O] by assigning alternating 0s and 1s as the correct action for each condition, commencing with 0 as the correct action for the topmost condition and working downwards. Correct actions received a reward of 100 while incorrect actions received a reward of 0.

The 4 [O]s represented in figure 12 each have |[O]| = 8 and string length 6, but each has a different mean hamming distance. Figure 13 shows that difficulty increases with MHD on this set of functions.

    H1   H2   H3   H4
    000 ###   00# 0##   00# 0##   000 ###
    001 ###   00# 1##   00# 1##   001 ###
    010 ###   01# 0##   01# 0##   01# 0##
    011 ###   01# 1##   01# 1##   01# 1##
    100 ###   10# #0#   10# #0#   10# #0#
    101 ###   10# #1#   10# #1#   10# #1#
    110 ###   11# #0#   11# ##0   11# ##0
    111 ###   11# #1#   11# ##1   11# ##1
 MHD   2   2.75   2.9375   3.125
12. From left to right: [O]s with increasing mean hamming distance, but constant |[O]| and string length.

We hypothesise that this effect is partly due to the greater ease of transforming more similar strings into each other by crossover and mutation. In [O]s with shorter mean hamming distances it is easier to move from one accurate general rule to another.

An additional factor may be involved. In section 3 we hypothesised that inaccurate rules contributed little to genetic search. If this is the case, mutation of an accurate rule into an inaccurate rule is a waste of effort, and slows the rate of genetic search. Rules which are more similar to other accurate rules are more likely to mutate into them, and less likely to mutate into inaccurate rules. Even if the accurate rule already exists, there may be more benefit in creating another copy of it than in creating an inaccurate rule. Optimal populations with smaller hamming distances between their rules are less likely to waste their efforts by producing inaccurate rules.

13. [%O] for [O]s with different mean hamming distances. Difficulty increases with MHD. Curves are averages of 100 runs.

4.  The Space of Single Step Functions

In the preceding sections we've identified several dimensions of single step problem complexity for XCS. In this section we consider some characteristics of the space of single step functions and how these dimensions structure it.

The space of single step functions grows rapidly with string length - there are 22l possible binary functions for a binary string of length l. We know that some of these functions are more difficult for XCS than others. One dimension which affects difficulty is |[O]| - results from section 3.2 show that difficulty increases with |[O]|. We can use |[O]| to structure the space of l-bit functions: at one extreme, with |[O]| maximised for a given l, we have a parity function. In this case [O] consists of 2l * 2 fully specific rules. (There are 2l inputs to match, and each maps to 2 actions.) At the other extreme, with |[O]| minimised, we have a constant function. In this case [O] consists of 2 rules with fully generalised conditions (since the fully generalised condition maps to 2 actions).

Although the parity and constant function bound |[O]| for a given string length, they are atypical: of the 22l functions for a given l there are only 2 parity and 2 constant functions. The vast majority of functions have |[O]| somewhere between the two.

If |[O]| was the only dimension relevant to difficulty we would be justified in stating that the difficulty of a function d([O1]) is greater than another d([O2]) if its |[O]| is greater. That is, if |[O1]| > |[O2]| then d([O1]) > d([O2]). This would mean that, for l-bit functions, parity was the hardest and the constant function the easiest. This would give us bounding cases on complexity for l-bit functions and a unique ordering among them (by |[O]|). Further, it would give us only 2l * 2 complexity classes (sets of functions of equivalent difficulty) in a much larger space of 22l functions. That is, if we wanted to test XCS on all different levels of difficulty for l-bit functions, we would only have to test 2l * 2 rather than 22l functions.

However, we know that |[O]| is not the only dimension of problem difficulty. Let's consider the others we've identified. Mean hamming distance, for a given string length l, covaries with |[O]|: the constant and parity functions - the bounding cases for |[O]| of a given l - have fixed mean hamming distances. MHD is only variable away from these extremes, so we need only consider its effect away from them. For example, we need not worry about whether MHD can make one parity function more difficulty than another, since they must have the same MHD. This suggests that, unless MHD has a very strong effect - and our studies suggest it does not - then complexity for l-bit functions is indeed bounded by the constant and parity functions. This issue deserves further study.

Unlike MHD, the reward range is independent of |[O]|: we can use any rewards we like with any function. This suggests that in comparing the complexity of functions, we should hold reward range constant, unless reward range is itself the object of comparison, in which case we should hold all other dimensions constant. This was the approach taken in section 3.3.

The above suggests that, for a given string length and reward range, |[O]| may be a reasonable metric for problem difficulty, and that the idea of dividing the space of functions into 2l complexity classes defined by |[O]| is also reasonable.

It is unfortunate that we have been unable to devise a more theoretically satisfying model of complexity than the "|[O]| + noise" model proposed above. However, it is perhaps not surprising given the complexity of XCS: the classifier update rules, genetic algorithm, generalisation mechanism, deletion scheme, triggered operators and other mechanisms all affect the system's performance. While no simple precise model of all the above has been found, we are pleased that a single dimension, |[O]|, provides a simple and seemingly reasonable metric for problem difficulty. A somewhat more precise metric could perhaps be devised by combining |[O]| and MHD, but we will not consider this here.

What other dimensions of single step problem difficulty exist for XCS, and what their significance is, remains to be seen. Because of this, it also remains to be seen whether |[O]| is sufficient as a complexity metric.

5.  A Ternary Single Step Test Suite

In section 3.2 we noted that generalisation is an important subject for XCS, and that |[O]| is a measure of the degree to which generalisation is possible using the ternary language. We also saw that |[O]| has a major effect on problem difficulty. In section 4 we saw how |[O]| can be used to structure the space of functions of a given string length, and how using |[O]| as a complexity metric divides the space of functions into a set of complexity classes. There are many fewer complexity classes than functions, which means we have to test XCS on only a small fraction of the function space to evaluate it at all levels of difficulty. However, there seems little need to go into such detail, particularly since higher values of |[O]| make increasingly fine distinctions about difficulty.

Based on these observations, we propose a single step test suite which ranges over the dimension of |[O]|, and - to the extent that |[O]| captures problem difficulty - ranges over problem difficulty. The suite is generated for l-bit strings as follows:

This algorithm yields a set of l+1 functions for l-bit strings. Recall that the number of functions grows hyperexponentially with the input string, and that the number of complexity classes defined by |[O]| grows exponentially with it. Using this test suite, however, the number of tests we have to make grows only linearly with the the input string. In other words, it scales well to longer string lengths in terms of the effort required to perform the tests.

Note that this test suite is specific to the ternary LCS language, and not to XCS. That is, it may be used with any LCS, or indeed any system employing the ternary LCS language.

A disadvantage of the test suite is that it considers |[O]| as the only dimension of problem difficulty. We would argue that the reward range can be considered separately to |[O]| - we can use any reward range with the test suite. We would also argue that the bounds on |[O]| provide bounds on MHD, since there is no variation in MHD at the bounds of |[O]|. There is the possibility that other, as yet unknown, dimensions of single step problem difficulty exist for XCS. Note, however, that the algorithm for generating the test suite does not specify how to select bits to ignore. By selecting bits in different orders we end up with different versions of the test suite. To cater for the possibility of unknown dimensions of problem complexity we could iterate the suite generation algorithm many times to produce many suites and average the results obtained from using them to test XCS.

The 6-bit tests shown in figure 8 were generated using this algorithm, with the leftmost relevant bit becoming irrelevant on each iteration of step 2.

6.  Summary

We began with some methodological considerations, arguing that our approach of studying single step tasks is reasonable even if we're really interested in sequential ones. We then distinguished between the input/output functions we often speak of and the RL problems XCS is really applied to. Next we presented a way of representing RL problems which is particularly well suited to systems which use the ternary LCS language. Then we saw, for the first time in the literature, how population size affects performance in XCS and took measures to take its effect into account.

We've also taken some steps towards answering the questions posed at the start of the paper. We've examined a number of dimensions of problem complexity, some of them (reward range and MHD) previously unknown. We've illustrated how a significant dimension, |[O]|, structures the space of functions and defines complexity classes within it. Based on this we've presented a single step test suite template that's simple to describe and implement, and which scales to any length input string. We hope this test suite will prove useful, both by improving the way we evaluate LCS, that is, through its use, and by spurring the search for a better suite, and the knowledge needed to construct one.

The work begun here can be extended in many ways. To begin with, the search for additional dimensions of complexity for XCS seems important, as does evaluation of the many hypotheses introduced to account for the effects observed in section 3. To what extent our approach is appropriate for other LCS remains to be seen, as does their sensitivity to the complexity dimensions we've examined with XCS.

Finally, we've provided a great deal of additional empirical evidence to support the suggestion in [Kovacs97a] that XCS reliably evolves [O]s for boolean functions. We suspect that XCS can reliably learn any function from the class studied here, given enough resources.

Acknowledgements

We'd like to thank the two anonymous reviewers for their helpful comments and Stewart Wilson for his support and inspiration over several years.

References


© 2001 Springer, LNAI 1996. Tim Kovacs and Manfred Kerber
The URL of this page is http://www.cs.bham.ac.uk/~mmk/papers/01-IWLCS.html.
Also available as pdf ps.gz bib .