Syntax has a terminology of its own which includes words like terminals, pre-terminals and non-terminals. This section introduces these terms, and includes some working definitions of more common words like noun, verb, preposition, determiner, etc.
Context-free grammars are a method of describing language (and perhaps many other hierarchical things, such as menus, file structures, etc). They are introduced, showing how they are related to phrase structure trees.
Overcoming the limitations of Finite State Automata
We saw in another section that Finite State Automata are mathematically inadequate for the description of natural languages that include centre-embedding. We'll start by considering a way of writing such descriptions.
The example required that our recogniser must only accept input in the form: cªdeª, where a > 0. We know that a valid sequence would be:
cdeNow, we'll stretch our notation a little. We'll let lower-case letters represent the "words" of the language (ie c, d, e). We'll put upper-case letters into our description when we want them to refer to something - but we don't quite know what it will be yet!
Confused? Look at this example.
cDeTo start-off, we'll make D stand in place of d. We'll write this rule as follows:
D -> dThis means that the previous sequence: cDe can now stand for cde.
So far we haven't gained anything, but made life more complicated for ourselves. Now we'll get something extra out of this complication. Let's use D to represent alternatives. We'll let it stand for either: d or (and this is the important one) cDe. We can write this as:
D -> dWhenever we see the symbol D we can rewrite it in either of these two ways.
D -> cDe
Just to make things more elegant, we'll force the very first example (cDe) into the same kind of form:
S -> cDeSuppose we start with the single symbol S. We can rewrite this by substituting the right-hand side of the arrow for the left-hand side. For instance:
S {start}
cDe {use the right-hand side of the S rewrite rule}
ccDee {replace D in the previous line with the
right-hand side of the second D rewrite rule}
cccDeee {replace D in the previous line with the
right-hand side of the second D rewrite rule}
cccdeee {replace D in the previous line with the
right-hand side of the first D rewrite rule}
(You may wonder why I used the second D rewrite rule twice and then abruptly used the first. The answer is simple: there was no algorithm to it; the example had simply become long enough to make the point I wanted.)
This rather bald way of simply setting down rewritings, line-by-line is only one method of writing derivations. A far more attractive and usual method is to draw a tree that shows (at each level) the rewrite that has taken place. We can think of each rule above as being a specification of part of a tree. So the rule:
S -> cDeis just another way of representing:
The tree for the full rewrite would look like:
D -> d
Thus far the example has been rather abstract, as it has been based on sequences of symbols which bear almost no relation to any natural language. We'll correct this with the next example.
vp -> verb_group np
verb_group -> aux verb
np -> pronoun
We can see that the non-terminals are: s, vp and np. But the symbols such as adj, aux, pronoun, noun and verb are not what we intend as words in the English language. We want to have some further rewrite rules that define the lexicon. These are:
aux -> are
noun -> planes
pronoun -> they
verb -> are
We draw trees with the root at the top, as was done before. The root is labelled with the distinguished symbol of the grammar. As an exercise, draw as many phrase-structure trees as you can that are dominated by S, then check your solution with mine.
A more considered look at Context-Free Grammar
In our efforts to overcome the limitations of Finite State Automata, we have developed a new formalism generally called the Context-Free Grammar formalism (CFG). CFGs have the following characteristics:
So in the grammar we developed earlier:
S -> cDe
The non-terminals are S and D and the terminals are: c, d and e.
D -> cDe
s -> np vp
vp -> verb np
np -> noun
np -> adj noun
adj -> flying
The "words" will be the leaves of the tree, but we use the term pre-terminal to describe any category that immediately dominates a terminal. So, the non-terminals are: adj, aux, noun, pronoun and verb.
verb -> flying
Drawing Phrase-Structure Trees
Let us suppose that we have an input such as they are flying planes. We can use the grammar to recognise whether the sentence is a member of the language described by the grammar or not. Indeed, if the sentence is a member of the language, then we shall be able to draw the phrase-structure tree or trees.