Genetic Programming and Data Structures

W. B. Langdon, University College, London, 1996, 350 pages

Chapter 8

Conclusions

The key to successful human produced software is using abstraction to control the complexity of each task in hand. I.e. in each task, being able to use objects without considering in detail how they interact. The objects are abstractions of lower level code or data, which are in turn abstractions of still lower levels. Thus a programmer may use a file while knowing nothing about how to program the disk drive on which it is stored. Indeed the program can continue to work even if the file is moved to another disk drive, even if it is of a different type, even a type which had not been designed when the program was written.

Genetic programming, with its undirected random program creation, would appear to be the anathema of highly organised software engineering. It is an evolutionary technique, using only information in its current population. It does not plan ahead or use a global top-down design but we have already seen, via ADFs and other techniques, it too can gain considerable benefit from functional abstraction.

While GP work to date has concentrated on functional abstraction, we argue that techniques which allow GP to take advantage of data abstraction will be essential to enable it to scale up and tackle large real problems. Chapters 3, 4 and 5 show GP can produce structured data types (stacks, queues and lists). In Chapter 6 we demonstrate GP can use data abstraction, by solving three problems. For the two more complex examples we have shown a stack abstract data type is beneficial. While the first example does not require a stack, a general solution was evolved which used the appropriate data structure. The failure of indexed memory to solve the two more complex problems, is disappointing, but was expected. While it is anticipated that it is possible to evolve solutions to the two problems using indexed memory, e.g. if increased resources are available, the richness of interactions supported by indexed memory allows complex interactions to arise and these complicate the search space making it harder to search. This problem is a general one. The search effort increases rapidly with problem complexity. While other research has shown changes to the representation (e.g. ADFs and more powerful primitives) can help, this work has shown that reducing the complexity of evolved programs through use of data abstraction to control interactions within evolving programs can also be beneficial.

Appendices C and D have demonstrated that both the combination of a GA and hand coded heuristic, and a GP using the same heuristics as seeds in the initial population can produce low cost maintenance schedules for a real world electrical power transmission network.

In many of the problems in this thesis, general scalable solutions have been evolved. This is very encouraging. Perhaps general algorithms are easier for GP to find? It may be argued on the basis of the Minimum Description Length (MDL) principle or Occam's Razor that general programs tend to be shorter than programs which are specific to the test case and fail to generalise . Non-general program may ``memorise'' the tests and need to be longer and more complex to do this. Perhaps solutions occur more frequently in the search space of shorter programs or perhaps GP is less effective at searching for longer programs?

The idea that symbolic regression is compressing the training data into a program, can be inverted. If a required program's size can be estimated, then so too can its information content. This gives a lower bound on the information content of the training data and thus, a lower bound on the size of the training set. This indicates that the volume of training data will need to increase as we try and evolve more ambitious programs. If we continue to test all evolved programs on all the training set then GP machine resource usage will grow at least quadratically with task complexity. However techniques such as co-evolution, soft brood selection and sparse training sets indicate it may not be necessary to exhaustively test every evolved program.

8.1 Recommendations

A number of practical recommendations for GP work can be made. To a large extent the advice in kinnear Advances in GP and GP1 remains sound, however a number of additional suggestions can be made:

8.2 Future work

There are many interesting questions raised by the work in this thesis. There are a number of techniques that have been introduced or which are fairly new to GP which warrant further investigation to further explore their benefits or clarify the best circumstances in which to use them. Examples include:

However the failure of GP to evolve data structures ``on the fly'' is the most important. Aspects that could be investigated include: Is the failure specific to the problems tried, the primitive sets used or insufficient resources dedicated to the task? If the later how much extra resources are required? While these are possible explanations, it is felt that this failure is part of the general difficulty of scaling up GP to solve more complex problems and so its solution would have a direct bearing on the fundamental scaling problem for GP.

The addition of data structures greatly extends the power of genetic programming. GP plus data structures should be evaluated on such problems. The use of stack data structures with other context free languages is an obvious first step.

W.B.Langdon@cs.bham.ac.uk 16 February 1997