Despite two decades of work, learning classifier systems researchers have had relatively little to say on the subject of what makes a problem difficult for a classifier system. One focus of our work has been the issue of what makes a problem difficult for XCS - Wilson's recent accuracy-based classifier system. This document outlines the approach taken, provides some initial results and outlines possible directions for future work.
Two basic questions to ask about any learning system are: to what kinds of problems is it well suited? To what kinds of problems is it poorly suited? Despite two decades of work, Learning Classifier Systems (LCS) researchers have had relatively little to say on the subject. Although this may in part be due to the wide range of systems and problems the LCS paradigm encompasses, it is certainly a reflection on the primitive state of LCS theory.
Delving into the subject more deeply, a host of other questions arise. What is it about a problem that makes it difficult or easy for a given system? In other words, what factors are involved in determining a problem's difficulty? How do these factors interact? What effects do the representation(s) used by the system have? What effects do the search engine(s) have? In short, we want to know about the dimensions of problem complexity for the system of interest. Clearly this is a difficult subject even for one particular learning system. In this case the system of interest is Wilson's XCS [Wilson95a]. To simplify matters, we consider only the use of the standard ternary LCS alphabet, a binary action space and single step reinforcement learning problems.
One reason to ask questions like those above is, of course, to find out when we can hope to successfully apply a given system to a given problem. Another reason is to find out how to evaluate a system. We can't test our systems on all possible problems, so we must consider only a subset. How can we choose this subset? It should capture the things which make a problem difficult for the system in question.
Our approach is to consider the space of all possible test functions (i.e. learning problems) given the various restrictions above. Before going any further, however, it seems important to cover some methodological considerations.
Single step functions are those in which the LCS's actions have no influence on which inputs it receives in the future. This contrasts with sequential functions, in which actions do influence future inputs. We mainly study single step functions for two reasons. First, some applications, e.g. data mining, are single step, so it is of interest to understand and to optimise LCS for single step problems. Second, single step functions avoid many complications of sequential ones. Even if we're interested only in sequential problems, it seems reasonable to first evaluate, understand, and possibly improve our systems in the simpler single step case.
In reinforcement learning (RL), feedback to the learner about the value of its actions consists only of numerical values, called rewards, and the goal of any reinforcement learner is to maximise the rewards it receives. Rewards are defined a priori by the experimenter in the form of a reward function, which maps states and actions to the real number line. The reward function is an essential component of any RL problem specification; changing the reward function changes the problem.
|input||action|| reward |
|00#||0|| 1000 |
|00#||1|| 0 |
|01#||0|| 0 |
|01#||1|| 1000 |
|1#0||0|| 1000 |
|1#0||1|| 0 |
|1#1||0|| 0 |
|1#1||1|| 1000 |
Working with LCS, we often refer to test functions in terms of an input/action space. For example, the 6 multiplexer has often been used as an LCS test. Figure 1 shows the 3 multiplexer, a related but simpler function. For now, consider only the input/action space defined by the left and centre columns. Rather than list each input individually we've used ternary classifier conditions to express generalisations over inputs. The 3 multiplexer is defined on binary strings of length 3, and treats the string as being composed of an index segment (the first bit) and a data segment (the remaining two bits). The value of the function is the value of the indexed bit in the data segment, so, for example, the value of 000 is 0, the value of 001 is 0, the value of 101 is 1 and so on. Knowing the value of the string we can do supervised learning; that is, we know which action the LCS must respond with in order to be correct.
However, this input/action mapping alone is not a complete specification of an RL problem, since it does not specify rewards. In adapting a boolean function to an RL paradigm we need to extend it by defining a reward function. So the 3 multiplexer is not a complete RL problem - we need to extend it with a reward function. This is where the right column of figure 1 comes in. 3 multiplexer problems differ when their reward functions differ.
XCS seeks to find accurate generalisations over the reward function. One effect of adding rewards can be to constrain what generalisations are possible. As figure 1 shows, the input/action structure of the 3 multiplexer admits a number of generalisations. However, if we choose a reward function with a sufficiently different reward for each state/action pair XCS will be unable to generalise at all. In this case we would need to use fully specific conditions in figure 1. Effectively this change in the rewards transforms the 3 multiplexer into a parity problem since it means no accurate generalisations are possible. This demonstrates that referring to input/output functions (e.g. multiplexer and parity functions) can be misleading when we are really referring to RL problems.
Notice how the 3 multiplexer was represented using classifier conditions to express generalisations. In effect the function is represented by a set of LCS rules. We could have used other sets of rules; for example we could have used a set of fully specific rules. The set we chose has certain properties: it is complete (maps the entire reward function), accurate (each rule maps only to a single reward), non-overlapping (no input/action/reward triplet is described by more than one rule), and minimal (no rules can be replaced by more general accurate, non-overlapping rules). We call a set of rules with these characteristics an optimal population [O]. We find it convenient to represent test functions by their optimal populations because it is easy to manipulate them to produce other optimal populations with their own corresponding functions. Working directly with the target representation makes it obvious what effects a transformation has.
So far we've discussed problem difficulty without defining it. Different measures are possible. The primary measure we've used is %[O], the proportion of the optimal population present in the classifier system on a given time step. This is useful as a measure of the progress of genetic search. An alternative metric, more commonly used with XCS, is that used by Wilson in, e.g., [Wilson95a], simply called `performance'. This is defined as the proportion of the last 50 inputs to which the system has responded correctly. It measures the extent to which XCS has found a population of classifiers constituting a solution to the problem. %[O], in contrast, measures the extent to which XCS has found the optimal solution. We prefer the use of %[O] because it is more sensitive to the progress of genetic search - %[O] can reveal differences which do not show up using the other performance measure. %[O] also seems a natural choice of metric given the importance of the length of [O] to problem difficulty (see section 3.2).
A third possible measure of problem difficulty is the population size required to efficiently learn a function. Different test functions - even those of the same string length - can have different size requirements.
What dimensions of single step problem difficulty are there for XCS? Wilson briefly investigated complexity in [Wilson98a] using a series of boolean multiplexer functions of string length 6, 11 and 20. The size of the optimal populations (|[O]| for short) for these functions is 16, 32, and 64 respectively. By evaluating the number of inputs XCS needed to learn each function Wilson worked out an expression relating |[O]| and difficulty, and another relating string length and difficulty [Wilson98a]. Not surprisingly, functions of greater string length and |[O]| were more difficult. Our experience with multiplexer and parity functions indicates that larger |[O]| is more difficult even when string length is constant. But, as we will shortly see, these are not the only dimensions of complexity.
The lengths of the input and action strings determine the size of the space of rules XCS searches in, so it seems reasonable that difficulty would generally increase with string length. But consider the Kolmogorov complexity of binary strings. It is quite possible for a long string to have low complexity while a much shorter string has high complexity. For example, in an appropriate language a string of 10,000 zeroes has low complexity, while a binary string of 100 randomly distributed ones and zeroes has higher complexity. The key point is that the longer string has much greater regularity than the shorter one. If the strings were computer files the longer string could be compressed into a smaller file than the shorter string.
It makes some intuitive sense that |[O]| would be a major factor in problem complexity for XCS because |[O]| is determined by how much useful generalisation XCS can express, and XCS is largely concerned with finding generalisations. Unlike earlier LCS, XCS was designed to learn a complete mapping from state/action pairs to their values [Wilson95a,Kovacs99d]. Further, it was designed with accurate generalisation over this mapping in mind, and accurate generalisation has been a focus of XCS research from the beginning.
Since an optimal population for a function is - assuming the ternary LCS alphabet - a minimal representation of the function, |[O]| is a measure of its complexity using this language. Thus we can think of |[O]| as a representation-specific measure of Kolmogorov complexity. Figure 2 shows the difficulty of a series of 6 bit functions of increasing |[O]| (see [Kovacs2000z] for details).
Why exactly does difficulty increase with |[O]|? One factor seems to be that as |[O]| increases, the rules in the population become, on the whole, more specific. More specific rules may be more difficult for XCS to find because XCS updates rules and affords them opportunities to reproduce only when they match an input. This has two consequences. First, more specific rules match fewer inputs and so are updated less frequently, and reproduce less frequently. The reproductive trial allocation scheme used in XCS (see [Wilson95a]) balances out the difference between general and specific rules to some extent, but specific rules still reproduce less than more general ones. Second, consider that a specific rule is likely to have been generated by the reproduction of another specific rule. This means that specific rules are generated less frequently because they reproduce less frequently.
As we saw in section 2.2, the reward function can constrain what generalisations are possible. Apart from this, however, we expected the form of the reward function to have little effect on problem difficulty, because in XCS fitness is based on the accuracy with which a rule predicts rewards, and not on the magnitude of the reward prediction. However, results show the range of rewards - that is, the difference between the highest and lowest reward - has a potentially strong effect on problem difficulty. This is due to the use of reward range in the normalisation of a rule's prediction error in the XCS fitness updates [Wilson95a].
The hamming distance between two strings is defined as the number of bits which must be changed to transform one into the other. The mean hamming distance (MHD) of an [O] is the mean distance between all possible pairs of rules (where rules are condition/action pairs). Figure 3 shows four [O]s, all with |[O]| = 8 and string length 6, but each with a different MHD. Figure 4 shows that difficulty increases with hamming distance. This effect is attributable to the greater ease of transforming more similar strings into each other by crossover and mutation. In [O]s with shorter MHD it is easier to move from one accurate general rule to another.
|H1||H2||H3|| H4 |
|000 ###||00# 0##||00# 0##|| 000 ### |
|001 ###||00# 1##||00# 1##|| 001 ### |
|010 ###||01# 0##||01# 0##|| 01# 0## |
|011 ###||01# 1##||01# 1##|| 01# 1## |
|100 ###||10# #0#||10# #0#|| 10# #0# |
|101 ###||10# #1#||10# #1#|| 10# #1# |
|110 ###||11# #0#||11# ##0|| 11# ##0 |
|111 ###||11# #1#||11# ##1|| 11# ##1 |
|MHD||2||2.75||2.9375|| 3.125 |
This work could be extended in many directions, but perhaps the most important is to study to what extent the dimension of difficulty already identified capture difficulty for XCS. It seems likely that other as yet unknown dimensions exist, and it may be that they have a significant effect on difficulty. To investigate this it should be possible to exhaustively test all 3 bit functions of the class studied here. The observed difficulty of each function could then be compared with that expected based on its |[O]| and MHD. The correlation between the two would give us an indication of the contribution of these two dimensions to overall problem difficulty. A problem with this proposal is that some dimensions of difficulty may have little effect with strings as short as 3 bits, and yet much more effect when string lengths are longer. It is not currently possible to exhaustively study longer functions because their number increases hyperexponentially with string length and the necessary computer resources are unavailable. This means we must rely on samples of the spaces of functions defined on longer strings.
Another line of research to follow is the characterisation of the space of boolean functions. We know that the parity and constant functions provide bounding cases in terms of |[O]|, but how typical are they? What is a typical function for a given dimension? Knowing this would help us parameterise XCS when we know nothing about the function it is being tested on.
We've outlined an approach to the study of problem complexity for XCS, an approach limited to an important subset of the problems we might want to apply it to. We've outlined a number of dimensions of problem complexity, some of them (reward range and MHD) previously unknown. We have elsewhere [Kovacs2000z] illustrated how a significant dimension, |[O]|, structures the space of functions and defines complexity classes within it. Based on this we've created a single step test suite template that's simple to describe and implement, and which scales to any length input string. We suspect that XCS can reliably learn any function from the class studied here, given enough resources. So far a great deal of empirical evidence has been gathered to support this suggestion, although no proof is available.