ABSTRACT
In this paper we propose how proof planning systems can be extended by an automated learning capability. The idea is that a proof planner would be capable of learning new proof methods from well chosen examples of proofs which use a similar reasoning strategy to prove related theorems, and this strategy could be characterised as a proof method. We propose a representation framework for methods, and a machine learning technique which can learn methods using this representation framework. The technique can be applied (amongst other) to learn whether and when to call external systems such as theorem provers or computer algebra systems. This is work in progress, and we hope to gain useful feedback from the CALCULEMUS community.
Proof planning [pub349] is an approach to theorem proving which uses proof methods rather than low level logical inference rules to prove a theorem at hand. A proof method specifies and encodes a general reasoning strategy that can be used in a proof, and hence represents a number of individual inference rules. For example, an induction strategy can be encoded as a proof method. Proof planners search for a proof plan of a theorem which consists of applications of several methods. An object level logical proof can be generated from a proof plan. Proof planning is a powerful technique because it often dramatically reduces the search space, allows reuse of proof methods, and moreover generates proofs where the reasoning strategies of proofs are transparent, so they may have an intuitive appeal to a human mathematician.
One of the ways to extend the power of a proof planning system is to enlarge the set of available proof methods. Often a number of theorems can be proved in a similar way, hence a new proof method can encapsulate the general structure, i.e., the reasoning strategy of a proof for such theorems. A difficulty in applying a strategy to many domains is that in the current proof planning systems new methods have to be implemented and added by the developer of a system. Our aim is to explore how a system can learn new methods automatically given a number of well chosen examples of related proofs of theorems. This would be a significant improvement, since examples exist typically in abundance, while the extraction of methods from these examples can be considered as a major bottleneck of the proof planning methodology. In this paper we therefore propose an approach to automatic learning of proof methods within a proof planning framework.
There is a twofold, albeit loose, relation to the Calculemus idea. The first relation is with respect to Kowalski's equation algorithm = logic + control [Kowalski:alc79]  our work aims at exploring general reasoning patterns and the control knowledge hidden in the sets of well chosen proof examples. The extracted knowledge, which is in Kowalski's sense a form of algorithmic knowledge, is represented in such a way that it can be reused for tackling new problems. The exploration of algorithmic knowledge is especially desirable in cases when this knowledge is not a priori available to a reasoning system (e.g., in form of a builtin or connected computer algebra system). The second relation is that our approach, which is described and applied here solely in the context of simplification proof examples in group theory, is not restricted only to this domain. It can analogously be applied to learn control knowledge from proof examples that already contain calls to computer algebra systems. Therefore, note that computer algebra systems may be employed in a subordinated way in proof planning by calling them within proof methods to perform particular computations [KeKoSo:icaipp98]. Given a set of examples each of which contains probably several such calls to computer algebra systems, our approach would enable a system to learn the overall pattern of these calls (if there is one). In this sense the learnt methods also encode knowledge of the controlled usage of computer algebra systems. More generally, this argument also applies to other external reasoning system which are subordinately employed within proof methods, and is not restricted to computer algebra systems only.
Figure 1 gives a structure of our approach to learning proof methods, and hence an outline of the rest of this paper.
A proof method in proof planning basically consists of a triple  precondition, postcondition and a tactic. A tactic is a program which given that the preconditions are satisfied executes a number of inference steps in order to transform an expression representing a subgoal so that the postconditions are satisfied by the transformed subgoal. If no appropriate method is available in a given planning state, then the user (in case of interactive systems) or the planner (in case of automated systems) has to explicitly apply a number of lower level methods (with inference rules as the lowest level methods) in order to prove a given theorem. It often happens that such a pattern of lower level methods is applied time and time again in proofs of different problems. In this case it is sensible and useful to encapsulate this inference pattern in a new proof method. Such a higher level proof method based on lower level methods can either be implemented and added to the system by the user or the developer of the system. Alternatively, we propose that these methods could be learnt by the system automatically.
The idea is that the system starts with learning simple proof methods. As the database of available proof methods grows, the system can learn more complex proof methods. Initially, the user constructs simple proofs which consist only of basic inference rules rather than proof methods. A learning mechanism built into the proof planner then spots the proof pattern occurring in a number of proofs and extracts it in a new proof method. Hence, there is a hierarchy of proof steps from inference rules to complex methods. Inference rules can be treated as methods by assigning to them pre and postconditions. Thus, from the learning perspective we can have a unified view of inference rules and methods as given sequences of elements from which the system is learning a pattern.Note that as a consequence of the hierarchic character of the method language  with methods corresponding to calculus level rules at the lowest level  the approach is in principle general enough to learn methods on every level of abstraction. While some heuristic information for the compound method can be computed from the component methods, learning more precise heuristic information for the compound method will be necessary. We do not address this problem in this paper.
To demonstrate our ideas with an example we need to first determine our problem domain  we choose theorems of abstract algebra, and in particular theorems of group theory. An example of a proof method is a simplification method which simplifies an expression using a number of simplification inference rules (which are in our unified view just basic level proof methods).One may assume that the simplification inference rules are already learnt as basic proof methods from rewriting style of proofs employing a single rewriting rule and appropriately instantiated group axioms. In the case of group theorems, the simplification method may consist of applying both (left and right) axioms of identity, both axioms of inverse and the axioms of associativity. Note that e is the identity element, i is the inverse function, and LHS > RHS stands for rewriting LHS to RHS:
EXAMPLE
EXAMPLE
An alternative to this approach is to rerepresent our problems, for instance, so as to omit the brackets in the presence of the associativity rules. However, this would be a much harder learning problem. In this paper we do not take this approach.
The representation of a problem is of crucial importance for solving it  a good representation of a problem renders the search for its solution easy. This is a well known piece of advice from Pólya [polyahtsi]. The difficulty is in finding a good representation. Our problem is to devise a mechanism for learning methods. Hence, the representation of a method is important and should make the learning process as easy as possible. Furthermore, it should be possible to represent loop applications of inference rules in methods. Here we present a simple representation formalism for methods, which abstracts away as much information as possible for the learning process, and then restores the necessary information so that the proof planner can use the newly learnt method. At the same time it caters for loop applications of inference rules in a method.
A major problem we are faced with when we want to learn compound methods from lowerlevel ones is the intrinsic complexity of methods, which goes beyond the complexity that can typically be tackled in the field of machine learning. For this reason we first simplify the problem by trying to learn the socalled method outlines (which is discussed next), and second, we reconstruct the full information by extending outlines to methods using precondition analysis (which will be discussed in Section 5).
Let us assume the following language L, where P is a set of primitives (which are the known identifiers of methods used in a method that is being learnt). In essence, this language defines regular expressions over method identifiers. The weight w defines the complexity of an expression:
As explained in the previous section, we use our language L for an abstract representation of methods, i.e., method outlines, in order to simplify the representation and render the learning process easier. Here we address how such method outlines can be learnt given a number of well chosen examples of proofs. Typically, there are many possible method outlines which describe a method that the system is learning. The general idea is to learn the simplest (following Occam's razor principle) and optimal method outline, and we measure simplicity (in our first approach) in terms of weight as defined before in Section 3. Notice that our decision to use simplicity to select a possible method outline is a heuristic choice. It may be that it is not the appropriate choice, that is, that we sometimes do not prefer the simplest method outline, but perhaps the most general one, or even the least general one.Here, we give an example of the simplest method outline, which is too general given our set of training example. Consider a set of 4 primitives {a,b,c,d}, and the training examples: {a,b,b,c,c,d}, {a,b,a,b,b,c,b,c,c,d,c,d}, {a,b,a,b,a,b,b,c,... }. These examples can be described by the following method outlines: [[[ab]c]d]^{*} and [[a,b]^{*},[b,c]^{*},[c,d]^{*}]. The first outline is of weight 4, whereas the second outline is of weight 8. If only simplicity heuristic is used, then the first outline is the chosen one. However, this is too general since it describes examples which it is not intended to describe, e.g. {a,a,a,a,a}. As part of our future work we may have to refine our notion of simplicity.
We discuss three possible approaches to learning: a complete generation of method outlines, a guided generation of method outlines, and a technique similar to least general generalisation (see [generalize1,generalize2]).
Our first attempt at a learning technique is to generate all possible method outlines in language L of a certain weight. We then prune this set by eliminating the method outlines which could not be instantiated to the representation of the examples. This technique is similar to breadth first search in that the solution is guaranteed to be found and to be optimal. Similarly to breadth first search, the technique is very inefficient, and may turn out to be unusable in practice for complex method outlines. Clearly, the size of the set with all the possibilities depends on the number of primitives and on the maximum weight of the method outline representing a possible method description. For example, if we have 3 primitives {m_{1},m_{2},m_{3}} then the size of the set of all possible method outlines of weight 0 is 3, the size of the set of all possible method outlines of weight 1 is 15, etc.The 15 terms of weight 1 are: [m_{i},m_{j}] for i,j in {1,2,3} , [m_{1}m_{2}], [m_{2}m_{3}], [m_{1}m_{3}], [m_{1}^{*}], [m_{2}^{*}], [m_{3}^{*}]. The candidates [m_{i}m_{i}] and [m_{2}m_{1}], [m_{3}m_{2}], [m_{3}m_{1}] are filtered out due to idempotency and commutativity of "". If the examples of simplification included only {invr,idr} and {invr,idl}, then the only possible method outline of weight 2 representing a generalised method outline is [invr,[idridl]] which is the same as [invr,[idlidr]] (where disjunction is commutative, but a sequence is not). Note also, that we exclude method outlines such as [invr^{*},[idridl]], [invr,[idr^{*}idl]], [invr,[idridl^{*}]], [invr,[idridl]^{*}] or any combination of these starred subexpressions  they are not as simple as possible.
The size of the set of possible method outlines increases hyperexponentially which poses a severe problem.For example, using only 4 primitives it is possible to generate all method outlines of weight 6, but method outlines of weight 7 are already beyond reasonable computer resources. We can improve upon this performance by generating only the relevant method outlines  this is our second attempt at a learning technique. For instance, given any example, we could avoid generating all those method outlines that cannot be part of the method outline to be learnt (for instance, if the sequence invr,invr does not occur in any of the training examples then no method outline containing this sequence should be generated). This would prune the possibilities dramatically. Such an approach can be justified as a valid heuristic approach because we do not expect the methods to be so complex (of large weight) that they are intractable, but rather that they are as simple as possible. In fact, we could argue that if a method needs to have a complexity beyond a certain weight, then it is better to first learn several less complex methods that comprise it, and then put these together in a larger method which is now of smaller complexity.
Although the two learning mechanisms discussed above are rather inefficient, they form a standard which can be used to measure the quality of a solution found by a procedure which is computationally more efficient.
A third possible approach for a learning technique is similar to Plotkin's least general generalisation [generalize1,generalize2]. For more detail see the full version of this paper [CSRP:JaKeBe00].
Methods expressed using the language introduced in Section 3 do not specify what their preconditions and postconditions are. They also do not specify how the number of loop applications of inference rules is instantiated when used to prove a theorem. Hence, the method representation needs to be enriched to account for these factors. We propose to use the ideas from precondition analysis developed by Silver [pub220] and later extended by Desimone [pub321] in order to enrich our method representation.
The idea of precondition analysis is to examine the reasons for applying each inference at each step of the proof. This is achieved by providing explanations for each step in the proof which are usually extracted from the information of preconditions and postconditions of a step. The preconditions of each rule used in a method are paired with additional information, namely the methods that generated these preconditions. Similarly, the postconditions of each rule used in a method are paired with the methods that use these postconditions. We extend Desimone's method schema representation with the effects that an inference rule has in the proof. Effects are used to express a change in the proof planning state which is not explicitly planned for, because, for instance, the underlying language may not be rich enough to express these changes. We demonstrate a representation of a method schema with an example.
Let there be a proof of a theorem T which consists of three steps, M_{1}, M_{2} and M_{3}. These methods consist of preconditions, postconditions, effects and tactics as demonstrated in the table below:
Preconditions  Postconditions  Effects  Tactic  
M_{1}  x & y & p(d)  x & y & z  w=f(d)  t_{1}  
M_{2}  y & p(w)  m  t_{2}  
M_{3}  z & m  n  t_{3} 
Using the language for method outlines described in Section 3 we are able to rerepresent methods in the way suggested by the precondition analysis. In our example above, the method schema, say M_{4} which is [M_{1},M_{2},M_{3}] in our language L, can be rerepresented as a method as given in Figure 2,
Pre  x & y
 
Tactic 
 
Post  n 
Analogously, a method outline with a disjunction such as [M_{1}M_{2}] can be represented as a method schema, the precondition of which is a disjunction of preconditions for M_{1} and M_{2}. Similarly, the postcondition of this method schema is a disjunction of postconditions of M_{1} and M_{2}. The planner has to be able to handle disjunctions of pre and postconditions, which is a nontrivial open problem in proof planning. Extending a planner to deal with such disjunctions needs to be addressed in detail in the future. The second argument of pre and postcondition pairs is determined as explained above. We further extend Desimone's method schema representation with a disjunction of explanations for method applications. Namely, if there is more than one method which generates/uses a pre/postcondition, then these are combined disjunctively. This allows us to encode the fact that a postcondition of a particular rule is also a precondition of the same rule, hence, the rule can be applied several times.
Depending on the example that the method is used to prove, the pre and postcondition pairs are instantiated differently. This determines if and how many repeated applications of a rule are needed. Hence, a method outline with a "*" such as [M_{1}^{*},M_{2}] has pre(x,_  M_{1}),pre(y,_  M_{1}) as preconditions of M_{1}, and post(x,_  M_{1}), post(y,M_{1}  M_{2}),post(z,_) as postconditions of M_{1} (the rest is as expected). Hence, if after the application of M_{1} all the postconditions of M_{1} satisfy its preconditions, and furthermore no other method in the tactic is applicable then M_{1} is applied again. If its postconditions satisfy the preconditions of another method, say M_{2}, then M_{1} is no longer applied in the proof. This is of course a heuristic decision and further research will have to examine whether it is appropriate.
Now we can represent a method outline [[assoclassocr]^{*}, [invrinvl],[idlidr]] as a simplification method using pairs of pre and postconditions with methods that generate or use them, and the effects that the rules have. To save space, but still convey the main points, we only consider a simplified version of simplification, namely a method outline [assocl^{*},[invrinvl],idl]. Notice that in order to be able to attach explanations to the inference rules in the style of precondition analysis, the method language needs to be extended. We extend it with the following vocabulary: subt(X,Y) for "X is a subterm of Y", nb(X,Y) for "X is a neighbour of Y" (two subterms of a term are neighbours if they are listed one after another when a tree representing the term is traversed in postorder), delta(X,Y,Z) for "a distance between subterms X and Y in a term Z" (the distance between two subterms is the number of nodes between the two subterms when a tree representing a term is traversed in postorder decreased by 1), and red(E_{1},E_{2}) for the fact that an expression E_{1} is reduced to E_{2} (an expression is reduced when it consists of fewer subterms than originally).The choice of this vocabulary is not important for this paper, and needs to be discussed elsewhere. Figure 3 shows the relevant inference rules augmented with appropriate explanations.
Rule  Preconditions  Postconditions
 
(assocr) 

 
Effects:
 
(idl) 

 
(invr) 

 
(invl) 


Figure 4 gives a method schema representation for a method outline [assocl^{*},[invr  invl], idl].Notice in Figure 4 that in the application of (assocl), A_{1} matches with A and A_{2} matches with A^{i}, or A_{1} matches with A^{i} and A_{2} matches with A. Note that applicable(x) means that all the preconditions for x are satisfied.
Pre  nb(A,A^{i},E) & (applicable(assocl) 
applicable(invr)  applicable(invl))
 
T a c t i c 
 
Post  red(E,E_{i}) 
In this paper we introduced a language for method outlines which can be used for describing compound proof methods in proof planning on an abstract level. These methods can carry out loop applications of less complex methods and can apply them disjunctively, depending on the theorem for which they are used to prove. We also introduced a technique for learning optimal method outlines from a number of examples of method applications. This technique is inefficient, hence we hinted how a more sophisticated approximative technique could be devised.
The method outlines of the introduced language can be encoded as proof method schemas in the style of Silver and Desimone in their work on precondition analysis. We demonstrated how a method outline learnt from a number of examples of the simplification method can be represented as a method schema.
Our approach is restricted to learning new higher level proof methods on the basis of the already given ones. We cannot learn language extensions such as a coloured term language which would be a prerequisite for learning any kind of methods similar to rippling [pub567]. In this paper we do not address the question how such a vocabulary can be learnt by a machine. Work by Furse on MU learner [furse94] may be relevant for this task.
Not much work has been done in the past on applying machine learning techniques to theorem proving, in particular proof planning. We already mentioned work by Silver [pub220] and Desimone [pub321] who used precondition analysis to learn new method schemas  we explained how we use their ideas in our work. Of interest is work on generalisation [generalize2], and other machine learning techniques such as inductive logic programming [muggleton94] and explanation based generalisation [mitchellebg].
Finally, there are many open questions that remain to be worked out. Here are some of them:
We would like to thank Alan Bundy for his continuing interest in and advice on our work, and in particular for pointing us to the work of Silver and Desimone. Furthermore we would like to thank Andreas Meier and Volker Sorge for their invaluable help in getting started to realise our ideas in Omega . Finally we would like to thank our referees for many useful comments. This work was supported by EPSRC grants GR/M22031 and GR/M99644.