TEACH CT_SYNTAX Chris Hutchison 15th October 1986 ***************************************************************************** File: $usepop/pop/local/teach/ct_syntax Purpose: Introduction to syntactic theory and parsing Author: Chris Hutchison 15th October 1986 Machines: Documentation: referenced in text Related Files: TEACH *CT_ELIZA ***************************************************************************** FORMAL GRAMMARS IN NATURAL LANGUAGE UNDERSTANDING 1. Why grammars? Human beings are able to produce and understand a quasi-infinite number of novel sentences, the 'quasi' indicating a non-linguistic limitation only of physical tiredness and length of life. E.g. you've probably never before heard or read the sentences in (1): (1) a. All stuffed grey elephants are moderately inflammable. b. There are no such things as triangular virtues. and you could no doubt invent any number of new sentences yourself, and feel reasonably confident that nobody has ever produced them before. Put in another way, this means that in any human language there are infinitely numerous different sentences. The task the linguist has set himself is to describe human languages - i.e. infinitely large sets of sentences - and to do so in a manner that enables him to distinguish between those strings of words that are sentences in the language described and those that are not; that is, to distinguish between strings such as those in (1) and non-sentences of the kind exemplified in (2): (2) a. Inflammable all grey moderately elephants stuffed are. b. There are are are are as as virtues. There are essentially two ways the linguist can go about his job: Method I: The linguist can attempt to list all the sentences in a particular language. Any possible sentence is then bound to be included in the enumeration and can be checked against it. Non-sentences, such as those in (2), will not be included, and therefore will not be recognized as sentences in the language. This method has some severe disadvantages, however: (i) Since we have already established that the number of sentences in a language is infinitely numerous, it follows that no finite list could ever be complete. For example, we could create from a list of sentences of length N an N+1th sentence made up of all the sentences in the list conjoined by the word 'and'. (ii) Although putting a large number of sentences down on paper may in some sense be equivalent to describing them, this method does not allow for the explicit expression of what kinds of things those sentences have in common that distinguishes them from non-sentences; that is, does not account for our intuitions, on encountering a string of words we have never seen or heard before, as to whether that string belongs to the list of sentences (i.e., to the language) or not. Method I is clearly unsatisfactory. The two disadvantages listed above converge in a common general observation. The human brain contains only a finite, even if awesomely vast, number of neurons and connecting synapses, and therefore human beings have a strictly limited memory. No single human brain nor even the totality of all human brains could therefore, by definition, hold in memory all the sentences of a language, since a finite space can not contain an infinite number of entities. Linguists therefore adopt another method of describing languages: Method II: The linguist seeks to specify a finite, and generally very small, number of criteria which any sentence in any particular language has to fulfill and of which native speakers have an implicit knowledge they can use when making linguistic judgements. This makes much more sense: a string of words is a sentence in a language not according to whether it appears on some hypothetical infinite list (and in any case, who would decide, and by what criteria, which strings should be in the list and which not?) but according to whether it meets certain criteria for sentence-hood, such criteria being implicitly known to speakers of the language and constituting, when formally expressed, the GRAMMAR of the language. (We shall think of a grammar in this sense as including both the words of the language and the rules for their combination). We may provisionally say, then, that a grammar contains a finite list of words - the dictionary - together with a limited number of rules which specify the possible combinations of words; in much the same way, the small number of pieces on a chess board (its 'dictionary') together with the rules of chess (its 'grammar') account for the quasi-infinite number of possible chess games. (We shall need to modify this informal definition later on). 2. The form of grammars. How do we go about discovering those rules? What form should the rules take? Remember how we found sentence patterns in ELIZA, for example: [my ??words drinks ??more_words] and [you == me] which match, respectively, the sentences 'my aunt Mabel drinks pina colada' and 'you never listen to me' ELIZA doesn't much care what words fill in the spaces in the patterns marked by the variables "??something" and by the symbols "==", and therefore the GIGO ('garbage in, garbage out') principle permits the generation of absolute nonsense. For example, if ELIZA's response to the first pattern is [tell your ??words to stop drinking ??more_words] and if the current input is [my last three drinks were all disgusting] then the nonsensical output will be [tell your last three to stop drinking were all disgusting] This is because ELIZA, though it makes rough predictions about the kinds of expressions that can appear in the spaces, has no mechanism for processing those expressions nor any linguistic knowledge that would allow it to recognise what kinds of expressions those are. The point is that sentences tend to be patterned in a small number of fairly regular ways. For example, I presume we all intuitively feel that the sentences in (3) share certain structural features, which are different from those shared by the sentences in (4). What is it that distinguishes the two groups of sentences? (3) a. My pet wallaby bit the postman b. My mother burned your letter c. My girlfriend caught a bus (4) a. My pet wallaby disappeared b. My mother died c. My girlfriend fainted We need to say more, for example, than that the pattern of the first group of sentences is: [my ??any_number_of_words] or more simply, if we're not going to want to use those words in the response: [my ==] since, in any case, this does not allow us to distinguish between the sentences in (3) and those in (4). One thing we might say is that the sentences in (3) have a final direct object while those in (4) do not. The general form of the sentences in (3) is and of (4) is Since the subject in both groups of sentences can be the same, the difference between the two groups must lie in the rest of the sentence. The second item -- that is, the verb -- in the sentences in (3) has different properties from that in the second group of sentences. Note the starngeness of, for example: (3') a. *My pet wallaby disappeared the postman b. *My mother died your letter c. *My girlfriend fainted a bus (4') a. *My pet wallaby bit b. *My mother burned c. *My girlfriend caught (-- the asterisk indicates that the sentence is ungrammatical. (4'b), of course, means something but not what it meant in (3b)). We can now say, then, that there is a part of speech -- a lexical category -- called a verb, and that some verbs require an object while others don't; those of the first kind we call 'transitive' verbs, those of the second we call 'intransitive' verbs. Can we be more specific about the form of the subject and object? Let us go back to one of our earlier patterns and look at some more examples of possible and impossible sentences in the English language. We began with the pattern: [my ??words drinks ??more_words] and decided that input sentence (5) elicited a grammatical response (5') but that sentence (4) didn't: (5) My aunt Mabel drinks pina colada (5') Tell your aunt Mabel to stop drinking pina colada (6) My last three drinks were all disgusting (6') Tell your last three to stop drinking were all disgusting The subject and the object of the sentence have to be NOUN PHRASES or other expressions which have the same DISTRIBUTION as noun phrases. The expressions "my aunt Mabel", "the postman", "pina colada", "a bus", and "your letter" are all noun phrases, and as such they all have the same distribution. For example, the sentences in (7) are all perfectly grammatical sentences -- let us say that they are SYNTACTICALLY WELL-FORMED -- even if one or two of them have slightly odd meanings: (7) a. The postman bit my mother b. My girlfriend burned my pet wallaby c. Your letter caught a bus d. A bus bit my girlfriend So we can say that a possible sentence pattern in English is: [noun_phrase verb noun_phrase] and you can probably see already that this is a much more formal and specific description of a sentence than our original ELIZA-like sentence pattern. (Ignore, for a moment, the details of how a computer program would go about recognizing a noun-phrase or a verb). The sentences in (8), like those in (7), are also SYNTACTICALLY WELL-FORMED, but differ from those in (7) in having intransitive rather than transitive verbs as main verbs: (8) a. The postman disappeared b. A bus died c. Your letter fainted d. My pet wallaby escaped The underlying pattern here is: [noun_phrase verb] So we can see that an initial noun-phrase can be followed by either a transitive verb and another noun-phrase or by an intransitive verb. Schematically, we might represent that rule as follows: / transitive_verb noun_phrase [noun_phrase ] \ intransitive_verb In other words, a has the same DISTRIBUTION as an ; and we can easily demonstrate this by listing two synonymous sentences which differ only in the transitivity of the verb: (9) a. My pet wallaby kicked the bucket b. My pet wallaby croaked each having the meaning "My pet wallaby died". So let us now say that the sentence is in each case made up of a subject and a predicate, and that the predicate can have one of two forms, depending on the transitivity of the verb. Actually, predicates can be made up of other things as well, so we shall be more specific and say that, in the above cases, the sentence is made up of an initial NOUN-PHRASE and a following VERB-PHRASE. We can write the PHRASE-STRUCTURE rules (sometimes called 'rewrite rules') we have discovered so far in the following conventional manner: S --> NP VP VP --> Vtrans NP VP --> Vintrans where the symbols "S" = sentence, "NP" = noun-phrase, "VP" = verb-phrase, "Vtrans" = transitive verb, "Vintrans" = intransitive verb, and "-->" means "can be replaced by". We can draw a diagram, called a PHRASE-MARKER or PARSE-TREE, to show how these rules capture the syntactic structure of the sentences in (9): (9') a. S / \ / \ / \ NP VP | | | / \ | / \ | Vtrans NP | | | | | | My pet wallaby kicked the bucket b. S / \ / \ / \ NP VP | | | | | Vintrans | | | | My pet wallaby croaked Compare the form of the PHRASE-MARKERS with the form of the PHRASE-STRUCTURE RULES. You will see that reading the phrase-markers from top to bottom is much like reading the phrase-structure rules from left to right. Just as we have been able to analyse a VP into its constituents, so too we can analyse the structure of a NP. In the following sentences the initial NPs all have the same distribution -- that is, they can all occur in the same syntactic 'slot' in the sentence: (10) a. The grey squirrel drank a pink gin b. Harvey drank a pink gin c. He drank a pink gin And yet the three NPs have different forms: 'he' is a pronoun, 'Harry' is a name or proper noun, and 'the grey squirrel' is made up of the definite article 'the', an adjective 'grey', and a noun 'squirrel'. Expressing these facts as phrase-structure rules, we get: NP --> Pronoun NP --> Proper_noun NP --> Art Adj Noun and we could then add these rules to the phrase-structure rules we have already discovered, thus: S --> NP VP VP --> Vtrans NP VP --> Vintrans NP --> Pronoun NP --> Proper_noun NP --> Art Adj Noun The phrase-marker for sentence (10b), for example, might then be represented in the following manner: (10') a. S / \ / \ / \ / \ NP VP | | | / \ | / \ ProperN Vtrans NP | | | | | | | | / \ | | / | \ | | Art | Noun | | | Adj | | | | | | | | | | | Harvey drank a pink gin 3. The form of grammars: revised definition. I informally defined a grammar earlier as a finite list of words -- the vocabulary or dictionary of the language -- and a finite set of rules for specifying all possible grammatical orderings of those words. But we can only order those words if we know what KINDS of words they are. I suppose it is just possible that we could say that "after the word 'the' we can have the word 'man' or the word 'alligator' or the word 'artichoke' or ..."; but this is hardly an economical way of going about things. A better way is to say that "after an article we can have zero or more adjectives followed by a noun". In other words, in addition to the rules on the one hand and the dictionary entries on the other, we have an intermediate set of items -- 'verb', 'noun', 'article', 'adjective', and so on -- which connect the words to the rules. These intermediate items name word classes or LEXICAL CATEGORIES; thus we can augment our grammar in the following manner: S --> NP VP NP --> Pronoun NP --> Proper_noun NP --> Art Noun NP --> Art Adj Noun VP --> Vtrans NP VP --> Vintrans N --> man N --> squirrel N --> gin N --> lemonade Pron --> he PropN --> Ralph Adj --> old Adj --> pink Adj --> beautiful Art --> the Art --> a Vtrans --> drank Vtrans --> caught Vintr --> died Vintr --> evaporated These rules then permit us to produce or GENERATE a fairly large number of different sentences, e.g. (11) a. A beautiful squirrel drank the lemonade b. The pink gin evaporated c. The old man caught a pink squirrel and so on. Try to work out for yourself which rules, starting from the "S" symbol, would have been used to generate the above sentences. Draw a phrase-marker for each sentence, labelling each sub-constituent in the manner of (9') and (10') above. 4. Generating versus parsing. Although the 'rewrite arrow' ( --> ) only points one way, this does not mean to say that the rules only work in one direction. The right-pointing arrow is no more than an arbitrary historical convention, and could just as easily be replaced by, for example, "<--", or "==", and still be held to mean the same thing: "can be replaced by". Thus: an Art + a Noun can be replaced by a NP just as well as: a NP can be replaced by an Art + a Noun The way in which one reads the arrow may sometimes depend (though NB by no means always) on whether one is GENERATING a novel sentence or PARSING a given sentence. Suppose, for example, you wish to generate a sentence; you might go through the following reasoning: "if I want to make a sentence, then I must first make a noun phrase and then a verb phrase. But to make a noun phrase I must find an article, perhaps an adjective, and then a noun; or, alternatively, I could use a pronoun or a proper noun. Then, to make up a verb phrase, I must find either an intransitive verb; or, alternatively, a transitive verb followed by another noun phrase" Compare this English description with the phrase-structure rules above. Now suppose you are given a string of words, and you want to know whether or not it is an English sentence. You might reason in something like the following manner: "Let's look at each word of the string. Is it listed in the grammar as an instance of a lexical category? (That is, does it appear on the right hand side of any grammar rule?) If so, then replace it with the lexical category. Do any of the symbols I now have combine together to form larger constituents? If so, then replace them with the label for the higher constituent. If I can keep on doing this until I get back to the symbol "S", then I have found a well-formed sentence; if not, then the string is not a sentence of English" Actually, there is no reason why a parser should not start from the "S" and work down towards the words of the input string, nor why a sentence generator should do so; and in fact, there are many kinds of parser which do start from the hypothesis that the string is a "S" and go on to try to prove that therefore the string is composed of a NP + VP, and so on. A parser which starts from the words themselves and works towards the building of a "S" we call a bottom-up or data-driven parser; one which starts from the "S" and seeks to prove, by going down through the constituents and sub-constituents of "S", that the words in the input string can be combined in specific ways to build up a well-formed sentence is called a top-down or hypothesis-driven parser. But for the time being you need not concern yourself with the technicalities of parsing and generation. In the next lecture we shall consider other types of linguistic knowledge that are likely to be necessary for the correct interpretation, by machine or man, of English sentences and texts. In the meantime, perhaps you can think of some kinds of knowledge, other than phrase-structure structure rules of the kind listed above, that a human being or a machine would have to have in order to understand ordinary English. --- File: local/teach/ct_syntax --- Distribution: all --- University of Sussex Poplog LOCAL File ------------------------------