Search and control in parsing

Using the grammar and lexicon presented in the section on the terminology of syntax and context-free grammars, we show how phrase-structure trees can be developed top-down or bottom-up. This is presented as search. Using the sentence they are flying planes, we show that the search can be controlled either depth-first or breadth-first.

Contents


Search strategies

The grammar presented in the section on the terminology of syntax and context-free grammars was:
s -> np vp

vp -> verb_group np
vp -> verb np

verb_group -> aux verb

np -> pronoun
np -> noun
np -> adj noun

The lexicon was:

adj -> flying

aux -> are

noun -> planes

pronoun -> they

verb -> are
verb -> flying

We have seen that we can derive phrase-structure trees from the grammar for the sentence they are flying planes. In this section we will describe how to search for solutions in the form of phrase-structure trees. There are two main search strategies: top-down and bottom-up.


The top-down search strategy

The top-down search strategy is sometimes known as hypothesis-driven search, because it essentially operates by proposing that the input string (in this example, they are flying planes), is covered by the distinguished symbol of the grammar. In the example given above, we would start by proposing an S as the distinguished symbol:

We can then draw in the non-terminals that are dominated by S:

We then pick one of the dangling non-terminals and expand it:

We then repeat the process by expanding another non-terminal:

and another:

until we have completed the expansion of all non-terminals.


The bottom-up search strategy

The bottom-up search strategy is sometimes known as data-driven search, because it essentially operates by building upwards from the input string (in this example, they are flying planes), to the distinguished symbol of the grammar. This example is rather an informal description:

We can then try to add a non-terminal that dominates one or more "lower" nodes:

We then repeat the process:

and again:

and again:

until we can add the distinguished symbol to the root of the tree:


Control of the search strategy

As we saw in another section, there is more than one phrase-structure tree that may be derived from the grammar when applied to they are flying planes. It is likely that we will want our NLP systems to find all trees of globally ambiguous sentences such as this. (We may want to choose between the trees on the basis of some other kind of information, such as semantic information.) When we're writing algorithms, we have to have a method of ensuring that we've considered every possibility, that is, we have to be sure that the output of the parser is complete.

Control is the name we give to the process of handling alternatives in search. There are essentially two kinds of control: depth-first control and breadth-first control.


Depth-first control of the search strategy

Control is the process of handling alternatives in search: depth-first control pursues one alternative as far as possible until it is successful or blocks. Only then does it consider the next alternative. This is the control used in usual implementations of Prolog and it is often associated with the term chronological backtracking.

To give an example, consider the top-down search when it has got as far as:

There are two VP rules which can be used:

vp -> verb_group np
vp -> verb np
In depth-first control, the alternatives are placed on a stack and the first item popped from the stack. This is expanded by placing alternatives on the stack. So our stack would look like:
verb_group -> aux verb [+ np]
vp -> verb np

When the aux and verb have been consumed, the alternatives rules for np are placed on the stack:

np -> pronoun
np -> noun
np -> adj noun
vp -> verb np

Each alternative would be popped from the stack in turn, until there only remains:

vp -> verb np

It is only now that this alternative VP rule can be used.


Breadth-first control of the search strategy

Breadth-first control uses a queue rather than a stack data structure. If we look again at the handling of the alternative VP rules, we'll see the difference.

There are two VP rules which can be used:

vp -> verb_group np
vp -> verb np

The first entry is removed from the queue and expanded, but the new entry goes at the end of the queue:

vp -> verb np
verb_group -> aux verb [+ np]

This means that the next entry to be expanded is the alternative VP rule, not the verb_group. As an exercise, work out the order in which top-down depth-first and top-down breadth-first search algorithms would find the two trees for they are flying planes, and then compare your solution with mine.


An evaluation of depth-first and breadth-first control

There are some misconceptions about depth-first and breadth-first control. With one important exception, they produce the same results, although not in the same order. If there are two solutions, one involving more "steps" than the other, then breadth-first search will find the solution with fewer "steps" before the solution with more "steps". The order in which depth-first search will find solutions is dependent on the order of rules in the grammar.

The important exception is where the search never halts because there is an infinite branch in the "search tree". Consider a situation where there is one solution and an infinite branch in the search tree. Breadth-first search will find the solution before disappearing into infinity (because the solution must have fewer "steps" than the infinite branch). Depth-first search may find the solution before disappearing into infinity, but only if the rules in the grammar are ordered so that the "finite" solution is found before the infinite branch.

In a situation where there are no infinite branches, which control is preferable? There is no obvious answer to this question. Depth-first used to be the only practical option because (as we'll see when we implement flexible control) on average there are fewer entries on the stack at any one time. When computers were very restricted as to the amount of memory available, this was an important criterion. These days, memory is very much larger and so doesn't affect the choice of control to the same extent. Perhaps some people would claim that there is psychological reality in one control method rather than another, but this a debatable point.


Summary

We have shown that it is possible to have either a top-down or bottom-up search strategy, depending on whether we start at the distinguished symbol (hypothesis) or the words (data). We've also shown that it is possible to have either depth-first control or breadth-first control. Notice that we've kept search strategy and control completely separate. The implication of this is that we can have four kinds of search algorithm:

  1. Top-down, depth-first
  2. Top-down, breadth-first
  3. Bottom-up, depth-first
  4. Bottom-up, breadth-first


© P.J.Hancox@bham.ac.uk