A grammar and lexicon
To keep things simple, we'll use the same grammar and lexicon that we developed parser the sentence they are flying planes. By "the same grammar and lexicon" I mean that we will use exactly the same, in the same form. In another section we used Prolog's Definite Clause Grammar (DCG) formalism to write a syntactic description that is directly executable in Prolog.
It is very convenient to be able to run our descriptions directly in Prolog, but it apparently limits us to a top-down, depth-first search strategy. If we want a different strategy, we have to write a specification of the strategy in Prolog code. The important thing to realise is that we don't need to use a different way of writing the rules: we can still write them in the form of a DCG. Our program has to be able to use knowledge written in the form of a DCG and parse appropriately.
There is an added advantage in using DCGs. Let us suppose (and seems very reasonable to do so), that for any DCG that will terminate when run (ie doesn't contain an infinite branch in the search tree), a Prolog compiler will give us all the possible solutions and in the correct order. We are going to write a program that will work either top-down depth-first (as do DCGs running directly within Prolog), or top-down breadth-first. We can compare the results we get with our depth-first version with those given by Prolog so as to obtain evidence that our program works correctly.
This will help us with the depth-first control, but not necessarily with the breadth-first search. However, the situation is not dire. We have already stated in another section that breadth-first and depth-first searches give the same results (providing there are no infinite branches in the search tree). (We would know if there were an infinite branch in the search tree for the grammar we've used, because when the DCGs were run directly in Prolog, the computation would never halt.) So we can be partly reassured that our version of top-down breadth-first search is correct if we obtain the same set of phrase-structure trees as for the depth-first version. As was stated elsewhere, breadth-first search will give the solution with the least number of "steps" in the computation before those with a greater number. In a straight-forward context-free grammar, such as the one used here, this means that a phrase-structure tree with fewer nodes would be produced before one with a greater number of nodes.
We will return to this topic when talking about the testing of the parser.
Data structure
The algorithm described here uses a very common AI technique: that of the agenda. In the context of AI, an agenda is just a store of things to do. It is important to realise that the agenda will hold information about every stage in the search: not just where there is ambiguity.
The difference between depth-first and breadth-first search was described as being simply the difference between a stack and a queue, so our agenda will have to be able to behave as either a stack or a queue.
In a sense, we are getting beyond ourselves. We need to decide first what information we must hold in each entry in an agenda. Fortunately the use of Prolog with its in-built unification of variables makes the design of the agenda very much simpler. In theory, we need record only one thing in an agenda entry: that is, a rule (which of course, has both a head and a body). This would be very concise, but we will use a more verbose representation in the belief that it is easier to follow.
One easy way to keep this information is to store in the agenda the top-level goal used to start each parse. In the example grammar we're using it would look like:
s(s(_21889,_21891),[they,are,flying,planes],[])Notice here how the two last arguments of the DCG are displayed explicitly within the agenda entry. The first argument (s(_21889,_21891)) represents the phrase-structure tree that will be built during the parse.
(np(_21889,[flying,planes],_21890),vp(_21891,_21890,[]))This may look unusual because it is a conjunction: that is, two subgoals that have to both be proved. Quite how the subgoals came to be in this form is a story for a later section. This example was taken from the same agenda entry as that for the top-level goal. Note the internal variables that Prolog has assigned to the two structures:
In summary we can describe the each agenda entry as:
agenda_entry(Sub_Goals, Goal, Words, Words0).
An example would be:
agenda_entry(np(_21889,[flying,planes],_21890),vp(_21891,_21890,[]), s(s(_21889,_21891),[flying,planes],[]), [flying,planes], [])
Having designed the individual entries in the agenda, we can look at the data structure for the whole agenda. As stated above, it will be either a stack or a queue. We want to be able to choose which at run-time and we don't know how many entries will be on the agenda at any given time. Both these factors point to the use of a list as the method of implementation. We can then use the standard definition of append/3 to add entries to the agenda.
In summary, the agenda will be a list of agenda_entry/4 structured objects, manipulated as either a stack or queue, according to the user's choice.
Retrieving and executing goals: some Prolog predicates
To retrieve grammar rules and lexical entries from the Prolog database, we use clause/2. To execute goals from within the parser, we use call/1. If you know how these two in-built predicates work, then move onto the Outline design section. Otherwise, work through the material on the use of clause/2 and call/1 in this parser before going on.
Outline design
The essential idea of the program is that it runs all the time the agenda has one or more entries, but halts when there are no more entries on the agenda. We also need to have a variable that is used to store the kind of control (depth-first or breadth-first) we wish to use. We can sketch the outline program as:
% 1 boundary - agenda is empty
parse([], _) :-
write('Parse completed.'), nl.
% 2 recursive - work on first member of the agenda
parse([Ag_Head|Ag_Rest], Control) :-
% find all rewrites of the first category of the
% right-hand side
% add all new agenda entries to Agenda, either depth-first
% or breadth-first
% recurse on the new agenda
parse(New_Agenda, Control).
From this, we'll expand the section:
% add all new agenda entries to Agenda, either depth-first % or breadth-firstRemember that the agenda is a list that will be used either as a queue or a stack, and that we have the variable Control that we'll use to decide whether to put new entries at the beginning of the list (a stack) or at the end (a queue). Adding entries to a list is simply a process of appending and we can use the standard definition of append/3.
What arguments will we need?
The two possible instantiations of the first (control) variable suggest that we need a minimum of two clauses in this procedure. There is nothing in the description of the other variables that suggest that we need any more alternative clauses. The code for adding entries to the agenda is:
% note order of arguments in append/3 subgoal! % 1 - depth-first add_to_agenda(d, Add_to_Agenda, Old_Agenda, New_Agenda) :- append(Add_to_Agenda, Old_Agenda, New_Agenda). % 2 - breadth-first add_to_agenda(b, Add_to_Agenda, Old_Agenda, New_Agenda) :- append(Old_Agenda, Add_to_Agenda, New_Agenda).
Having done this, we can modify the previous version of the outline program, as follows:
% 1 boundary - agenda is empty
parse([], _) :-
write('Parse completed.'), nl.
% 2 recursive - work on first member of the agenda
parse([Ag_Head|Ag_Rest], Control) :-
% find all rewrites of the first category of the
% right-hand side
% add all new agenda entries to Agenda, either depth-first
% or breadth-first
add_to_agenda(Control, New_Ag_Membs, Ag_Rest, New_Agenda),
% recurse on the new agenda
parse(New_Agenda, Control).
We now have to expand the section we've described as:
% find all rewrites of the first category of the % right-hand sideThis means that if we have a rule:
s -> np vpwe have to find all possible ways in which np could be rewritten, according to the grammar. This will involve us in some quite intricate detective work. Before we launch into this, let's lay a bit of the ground work by considering what information we have at our disposal and what information we want to get out of this process. We want to produce a list of new entries to add to the agenda (New_Ag_Membs). Our input will be the item at the head of the agenda (Ag_Head).
The detective work comes in deciding in what Ag_Head will look like. We need this information because we must write a specification of what we want to happen to each and every form of the agenda_entry. We'll start with what we want to do with leaves of trees.
First we have to know what Prolog does with the leaves of DCGs. Remember that Prolog converts something like:
np --> adj, noun.into:
np(S, S0) :- adj(S, S1), noun(S1, S0).What does Prolog do with lexical entries such as?
adj --> [flying].It gets transformed into:
adj(S, S0) :- 'C'(S, they, S0).The body of this rule looks strange, because every good Prolog programmer has been brought up never to use uppercase letters (even in quotes) as the name of structures, rules and facts. 'C'/3 is an in-built Prolog procedure that can be described as:
'C'([A|B], A, B).In English, this ensures that the word in the dictionary is the same as the head of the input (S), difference list).
For our present purposes, it means that we have to specify what happens every time our parser finds a 'C'/3 goal. We'll look first as what happens when we're processing the very last word in a sentence. When we've got to the end of the sentence, there will only be this lexical entry left in the agenda (because there is nothing more to rewrite). What we do is to write ourselves a rule that does the same as the 'C'/3 fact described above. Our rule looks like:
% 1 - last LE matches - so finish
get_rest(agenda_entry('C'([S|S0], S, S0), Original_Call, [S|S0], S0), []) :-
write('Successful parse: '),
write(Original_Call), nl.
Now here comes the difficult bit. Generally in Prolog, we only specify the things that are "true" and let Prolog fail on the things that we're not interested in. Agenda-driven software relies on not failing at all, but letting the computation finish when the agenda is empty. So we have to specify what the program should do when the previous condition doesn't match. In this case only, it is complicated. We'll refer to this example:
agenda_entry('C'([S|S0], S1, S2), Original_Call, [S|S0], S2)
There are two reasons why condition "% 1" might fail:
% 2 - last LE doesn't match - for some reason or other
get_rest(agenda_entry('C'([S|S0], S1, S2), Original_Call, [S|S0], S2), []) :-
(\+ (S = S1), \+ (S0 = S2)) ; % condition 1 & 2
(\+ (S = S1), S0 = S2) ; % condition 1 only
(S = S1, \+ (S0 = S2)). % condition 2 only
When the agenda contains an entry for a leaf that isn't the last ("right-most") leaf in a tree, the agenda entry will always be in the form of a conjunction: (Leaf, Rest). I opted for a simpler way of programming the conditions than for clauses % 1 and % 2, by calling the 'C'/3 structure as the argument of call/1. This is the condition for success:
% 3 - LE - but not the last - matches
get_rest(agenda_entry(('C'([S|S0], S2, S0), Rest), Original_Call, [S|S0], S1),
[agenda_entry(Rest, Original_Call, S0, S1)]) :-
call('C'([S|S0], S2, S0)).
Note here that the result of this success is that the new agenda entry is the same as the old agenda entry, but without the 'C'/3 structure.
As with % 2, we have to specify what should happen when the success condition doesn't work. This time, we can take a reasonable short cut. The failure condition will hold when the body of the rule isn't true:
% 4 - LE - but not the last - doesn't match
get_rest(agenda_entry(('C'([S|S0], S2, S0), Rest), Original_Call, [S|S0], S1),
[]) :-
\+ (call('C'([S|S0], S2, S0))).
Notice here that there isn't an entry in the new agenda list.
The remaining three clauses specify what is to be done with non-terminals, but before we describe each one in turn, we'll look at what the tree conditions they deal with are:
For the single goal, all we need do is to "findall" ways of rewriting the goal (which we've called First). We use clause/2 to find all possible rewrites.
% 5 - success - first item is a non-lexical item get_rest(agenda_entry(First, Original_Call, Words, Words0), New_Ag_Membs) :- \+ (First = (_,_)), % ensure First isn't a conjunction \+ functor(First, 'C', _), % ensure First isn't a lexical entry % find all phrase structure rules that expand symbol findall(agenda_entry(RHS, Original_Call, Words, Words0), clause(First, RHS), New_Ag_Membs).
For the conjunction, we use much the same code as for % 5 (single goal), except the new agenda entry must include the second part of the conjunction (referenced by the variable Rest).
% 6 - success - first item is a non-lexical item get_rest(agenda_entry((First, Rest), Original_Call, Words, Words0), New_Ag_Membs) :- \+ (First = (_,_)), % ensure First isn't a conjunction \+ functor(First, 'C', _), % ensure First isn't a lexical entry % find all phrase structure rules that expand symbol findall(agenda_entry((RHS, Rest), Original_Call, Words, Words0), clause(First, RHS), New_Ag_Membs).
For the nested conjunction, we have to be a bit more subtle. Here we have to "look inside" the conjunction until we find something we can either rewrite (if it is a non-terminal) or eliminate (if it is a terminal). Providing we've set up the previous six conditions correctly, this is quite easy, because it is simply a process of recursively taking the first goal of a conjunction until one of the first six previous get_rest/2 clauses apply. Note how the new agenda is built up recursively as well.
% 7 - success - first item is a non-lexical item but is a nested conjunction
% - first argument ("(First, Second)") of first argument
% ("((First, Second), Rest)") is a conjunction
get_rest(agenda_entry(((First, Second), Rest), Original_Call, Words, Words0),
New_Ag_Membs) :-
% find all phrase structure rules that expand symbol
findall(agenda_entry((RHS, Rest), Original_Call, Words1, Words0),
(get_rest(agenda_entry((First, Second), Original_Call, Words, Words0),
New_Ag_Membs1),
member(agenda_entry(RHS, Original_Call, Words1, Words0),
New_Ag_Membs1)),
New_Ag_Membs).
Now we can complete the original outline of the program:
% 1 boundary - agenda is empty
parse([], _) :-
write('Parse completed.'), nl.
% 2 recursive - work on first member of the agenda
parse([Ag_Head|Ag_Rest], Control) :-
% find all rewrites of the first category of the
% right-hand side
get_rest(Ag_Head, New_Ag_Membs),
% add all new agenda entries to Agenda, either depth-first
% or breadth-first
add_to_agenda(Control, New_Ag_Membs, Ag_Rest, New_Agenda),
% recurse on the new agenda
parse(New_Agenda, Control).
and add a piece of code that will initialise the agenda by finding all possible rewritings of DCG rules for the distinguished symbol:
parse(Words, Remainder, Dist_Symbol, Arity, Control) :- % make a structured object that has the Dist_Symbol as its % functor and has the number of arguments given by Arity. functor(Dist_Symb_Head, Dist_Symbol, Arity), % make second to last argument the Words to be parsed Words_to_Consume is Arity - 1, arg(Words_to_Consume, Dist_Symb_Head, Words), arg(Arity, Dist_Symb_Head, Remainder), % find all DCG rules that have the Distinguished_Symbol % as their LHS and add them to the agenda with their RHS (Body) findall(agenda_entry(Body, Dist_Symb_Head, Words, Remainder), clause(Dist_Symb_Head, Body), Agenda), % then parse from the Agenda parse(Agenda, Control).This looks complicated, but is just a generalised description that will allow the use with DCGs with any number of arguments, assuming that S and S0 (the difference lists for words) will always be the last two arguments.
Assuming that the DCG used in the description has been loaded, the program given above can be run with queries such as:
| ?- parse([they,are,flying,planes],[], s, 3, d). | ?- parse([flying,planes],[], np, 3, b). | ?- parse([they,are,flying,planes],[are,flying,planes], np, 3, d).
Using the parser
The parser given in the previous sections will work, but I've prepared a longer version that includes some tracing information, so you can see how the agenda changes during the parsing. Copy this program into your user area and load it into Prolog so you can see it at work.
This is called with much the same query as the previous version, except there is one further argument to turn the display of the agenda on (y) and off (n).
| ?- parse([they,are,flying,planes],[], s, 3, d, y). | ?- parse([flying,planes],[], np, 3, b, y). | ?- parse([they,are,flying,planes],[are,flying,planes], np, 3, d, n).
Use this program to investigate the following:
Testing
How do I know that my example program is correct? The quick answer is that I don't know for certain. I am confident, but not certain. My reasons for being confident are as follows:
I could have done more testing. If this were the core parser in a research system, I would have probably created example test queries for each procedure, showing which clause should match with which query and showing the variable bindings. I could even have attempted to run the DCG version in a breadth-first Prolog.
Efficiency
If you've tried running the depth-first/breadth-first parser in tandem with the DCG directly running in Prolog, you may have noticed that the former is slower (especially on older micros). This is partly because the depth-first/breadth-first parser uses an explicit data structure and partly because it carries around more arguments than its DCG counterpart. There are undoubtedly things that could be changed in the program to make it quicker, for instance some parts could be optimised to take account of efficiency features in Prolog compilers, and the agenda could certainly be changed to something somewhat simpler. The simpler version would have been "more concise" which is often another way of saying "less intelligible".
This parser was designed essentially for pedagogic purposes and efficiency wasn't the prime concern.
A brief note on the meaning of breadth-first and depth-first control
This parser uses a rather restricted interpretation of choice. Consider the following Prolog program:
a :- b, c. a :- d. b :- e. c. d. e.As with the parser, there is a choice of clauses in the procedure a/0. We can put both clauses onto the agenda, and then process them either depth-first or breadth-first. But there is another dimension of choice: the choice of subgoals. When we use the rule:
a :- b, c.we can work on the first subgoal (b) until we have proved it, then move onto the second subgoal (c). In essence, we are proving the subgoals in a depth-first way. But there is an alternative: we can take one "step" toward solving b, then one step toward solving "c", then back to the solving of "b", then onto "c", and so on until we have no more subgoals to prove. In effect, we're processing the subgoals in a breadth-first way.
If we were writing a depth-first/breadth-first Prolog, we would have to allow the user to specify if they wanted:
Clause choice Depth-first Breadth-first Sub-goal Depth-first df/df df/bf choice Breadth-first bf/df bf/bf
From this simple table, we can see we have four possible Prolog interpreters (of which "normal" Prolog is "df/df".
If we were writing search algorithms for many areas of AI, we would have to choose between the different configurations, but in syntactic parsing we are truly lucky. For many languages (of which English is one), the order of words ("from left-to-right") makes it less than practical to have anything other than "df/df" and "bf/df" algorithms. We just would be denying ourselves the useful knowledge of where one syntactic constituent ends and the next starts. It might be interesting to try out the "df/bf" and "bf/bf" configurations on languages that have free word order (like Hellenic Greek), but that is something for a project.