Introduction to the terminology of syntax and context-free grammars

Syntax has a terminology of its own which includes words like terminals, pre-terminals and non-terminals. This section introduces these terms, and includes some working definitions of more common words like noun, verb, preposition, determiner, etc.

Context-free grammars are a method of describing language (and perhaps many other hierarchical things, such as menus, file structures, etc). They are introduced, showing how they are related to phrase structure trees.

Contents


Overcoming the limitations of Finite State Automata

We saw in another section that Finite State Automata are mathematically inadequate for the description of natural languages that include centre-embedding. We'll start by considering a way of writing such descriptions.

The example required that our recogniser must only accept input in the form: cªdeª, where a > 0. We know that a valid sequence would be:

cde
Now, we'll stretch our notation a little. We'll let lower-case letters represent the "words" of the language (ie c, d, e). We'll put upper-case letters into our description when we want them to refer to something - but we don't quite know what it will be yet!

Confused? Look at this example.

cDe
To start-off, we'll make D stand in place of d. We'll write this rule as follows:
D -> d
This means that the previous sequence: cDe can now stand for cde.

So far we haven't gained anything, but made life more complicated for ourselves. Now we'll get something extra out of this complication. Let's use D to represent alternatives. We'll let it stand for either: d or (and this is the important one) cDe. We can write this as:

D -> d
D -> cDe
Whenever we see the symbol D we can rewrite it in either of these two ways.

Just to make things more elegant, we'll force the very first example (cDe) into the same kind of form:

S -> cDe
Suppose we start with the single symbol S. We can rewrite this by substituting the right-hand side of the arrow for the left-hand side. For instance:
S		{start}
cDe		{use the right-hand side of the S rewrite rule}
ccDee		{replace D in the previous line with the
		 right-hand side of the second D rewrite rule}
cccDeee		{replace D in the previous line with the
		 right-hand side of the second D rewrite rule}
cccdeee		{replace D in the previous line with the
		 right-hand side of the first D rewrite rule}

(You may wonder why I used the second D rewrite rule twice and then abruptly used the first. The answer is simple: there was no algorithm to it; the example had simply become long enough to make the point I wanted.)

This rather bald way of simply setting down rewritings, line-by-line is only one method of writing derivations. A far more attractive and usual method is to draw a tree that shows (at each level) the rewrite that has taken place. We can think of each rule above as being a specification of part of a tree. So the rule:

S -> cDe
is just another way of representing:

The tree for the full rewrite would look like:


A more considered look at Context-Free Grammar

In our efforts to overcome the limitations of Finite State Automata, we have developed a new formalism generally called the Context-Free Grammar formalism (CFG). CFGs have the following characteristics: So in the grammar we developed earlier:
S -> cDe

D -> d
D -> cDe

The non-terminals are S and D and the terminals are: c, d and e.

Thus far the example has been rather abstract, as it has been based on sequences of symbols which bear almost no relation to any natural language. We'll correct this with the next example.

s -> np vp

vp -> verb_group np
vp -> verb np

verb_group -> aux verb

np -> pronoun
np -> noun
np -> adj noun

We can see that the non-terminals are: s, vp and np. But the symbols such as adj, aux, pronoun, noun and verb are not what we intend as words in the English language. We want to have some further rewrite rules that define the lexicon. These are:

adj -> flying

aux -> are

noun -> planes

pronoun -> they

verb -> are
verb -> flying

The "words" will be the leaves of the tree, but we use the term pre-terminal to describe any category that immediately dominates a terminal. So, the non-terminals are: adj, aux, noun, pronoun and verb.


Drawing Phrase-Structure Trees

Let us suppose that we have an input such as they are flying planes. We can use the grammar to recognise whether the sentence is a member of the language described by the grammar or not. Indeed, if the sentence is a member of the language, then we shall be able to draw the phrase-structure tree or trees.

We draw trees with the root at the top, as was done before. The root is labelled with the distinguished symbol of the grammar. As an exercise, draw as many phrase-structure trees as you can that are dominated by S, then check your solution with mine.


Ambiguity

The previous example showed a clear case of global ambiguity. Ambiguity is where there is more than one interpretation possible. In NLP we distinguish between global ambiguity, where there is more than one possible interpretation of a whole utterance (eg a sentence) and local ambiguity. The latter is where there is a part of an utterance that seems ambiguous. Later information will allow us to ignore all or most of the interpretations.


Syntactic categories

To those coming to NLP without much of a background in linguistics or traditional grammar, the terms used in syntax can seem very confusing. We've already encountered the words non-terminal, pre-terminal and terminal. The good news is that terminals are words from the lexicon, so we need concern ourselves with these no further.

Non-terminals In Context-Free Grammars (CFGs), we write rules to describe phrases. Phrases are simply groups of words and the name of the phrase usually takes its name from some important word in the phrase, for instance:

There are other phrases that are sometimes used, such as verb_group used in the previous example.

Pre-terminals In the CFG we have used so far, we used categories such as:

adj = adjective
A word that qualifies a noun - for instance tabby in tabby cat.

adv = adverb
A word that qualifies a verb - for instance boldly in to go boldly.

aux = auxiliary
A common verb occurs with another verb to add some further meaning - for instance will in will fly adds the notion of future action.

det = determiner
A word which specifies - for instance the in the book.

noun
A word that refers to an object or entity - for instance cat.

prep = preposition
A word that denotes position - for instance in in in the park.

pronoun
A word that refers to a person without using its name - for instance they, he, it.

verb
A word that refers to an action - for instance flying, washing, thinking.

It would be nice to be able to say that these definitions are cast in stone and beyond reproach. They are not. There are many theoretical and empirical objections to this classification, not least that syntactic categories are defined in terms of their meaning. These categories are widely used and are used here for pragmatic reasons. The objections to this classification are beyond the scope of this course, but are laid out by Radford (1988).