Morphological analysis

This section has three parts. In the first part, some basic terms in morphology is introduced, in particular, morpheme, affix, prefix, suffix, boundand free forms. The second reviews conventional ways of grouping languages, such as isolating, agglutinative and inflecting. The final section looks at some morphological processes, concentrating only on those of greater relevance to natural language engineering.

Some terminology

Linguistics sets out to describe language. Any description needs some terminology with which to set out its description. We can think of this as the technical vocabulary of the discipline. Natural languages have their own terms to describe themselves. For instance, we colloquially talk about "words", "phrases", "sentences" and "paragraphs". Do we know what these words mean?

We'll look at just the definition of the word. In text like this, we can easily spot "words" because they are separated from each other by spaces or by punctuation. However, if you record ordinary, conversational speech, you will find that there are no breaks between words. In spite of this, we could isolate units which we use in speech again and again, but in different combinations. This suggests that there is a small unit something like a word. But just how do we define a "word"? We will all agree that black and bird are words. Is blackbird one word or two words? Is blackbirds the same word as blackbird or a separate word?

There are no easy answers to these questions. The situation is more complicated because

  1. a linguistic theory should also account for how sounds (phonology) are bound to "words"
  2. because English is a relatively straightforward language at this point: other languages have far more complicated ways of changing the forms of words than English has.

Morphology is the study of the structure and formation of words. Its most important unit is the morpheme, which is defined as the "minimal unit of meaning". (Linguistics textbooks usually define it slightly differently as "the minimal unit of grammatical analysis".) Consider a word like: "unhappiness". This has three parts:

There are three morphemes, each carrying a certain amount of meaning. un means "not", while ness means "being in a state or condition". Happy is a free morpheme because it can appear on its own (as a "word" in its own right). Bound morphemes have to be attached to a free morpheme, and so cannot be words in their own right. Thus you can't have sentences in English such as "Jason feels very un ness today".

Armed with these definitions, we can look at ways used to classify languages according to their morphological structure.

The classification of morphological structural types

It was suggested above that English is, from the morphological point-of-view, quite a straightforward language. The implication is that other languages behave in rather a different ways and this is the basis of this classification scheme. Linguists of earlier generations were quite interested in producing family trees of languages to show which modern languages descended from which earlier ones and, perhaps, even being able to reconstruct lost languages. Morphological structure is just one way of grouping languages.

There are usually three classes in this classification.

Isolating languages
The words in an isolating language are invariable. To put it another way, it is composed of free morphemes and so there are no morphemes to indicate information like grammatical number (eg plural) or tense (past, present, future). Mandarin Chinese is often quoted as an example of such a language (although some claim Vietnamese to be a better example). The transliterated sentence:
gou bú ài chi qingcài
may be literally translated as:
dog not like eat vegetable
Depending on the context, it can mean any of the four following sentences:
the dog did not like to eat vegetables
the dogs do not like to eat vegetables
the dogs did not like to eat vegetables
dogs do not like to eat vegetables

Agglutinative languages
My dictionary gives the definition of agglutinate as "unite as with glue; (of language) combine simple words without change of form to express compound ideas". Textbook examples are usually based on Turkish or Swahili, of which we'll use the former. In our example we'll use the following morphemes:

To complete our example, we need a Turkish noun, in this case ev which means "house". From this noun we can make the following words:

(Notice that the possessive morpheme i is regularly followed by n before den.)

The important thing about this example is to notice how the morphemes all represent a "unit of meaning" and how they remain absolutely identifiable within the structure of the words. This is in contrast to what happens in the last class: the inflecting languages.

Inflecting languages
The words in inflecting languages do show different forms and it is possible to break the words into smaller units and label them, in the same way that the Turkish example was presented above. However, the result is a very muddled and contradictory account. Usual examples are based on Latin and rely on a knowledge of the Latin grammatical case example, which most English undergraduates don't have. As a simple example, the Latin for "I love" is amo. This is means that the ending o is used to express the meanings, first person ("I" or "we"), singular, present tense, and also other meanings.

This classification has only three classes. Is it really possible to fit all the world's languages into three classes? From one way of looking at the problem, it is impossible to fit any of the languages into any of the classes, because each language is impure. That is to say, if you look hard enough, you will find inflection in mainly agglutinative languages, inflection in isolating languages, agglutination in inflectional languages and so on.

What lessons can we draw from this? I think there are two points worth making. The first is that languages vary greatly and generalizations based on the experience of only one language (such as English) are likely to be easily counter-exampled from other languages. The second is that language is a naturally occurring phenomenon and "tools" we use to study it (such as classifications and technical terms) are only tools, which may be imperfect attempts to describe something too complex for our current science.

Morphological processes

In the example given above of unhappiness, we saw two kinds of affix, a prefix and a suffix. Just to show that languages do really vary greatly, there are also infixes. For instance the Bontoc language from the Philippines use an infix um to change adjectives and nouns into verbs. So the word fikas, which means "strong" is transformed into the verb "be strong" by the addition of the infix: f-um-ikas.

There are a number of morphological processes of which some are more important than others for NLP. The account given here is selective and unusual in that it points out the practical aspects of the processes selected.

Inflection is the process of changing the form of a word so that it expresses information such as number, person, case, gender, tense, mood and aspect, but the syntactic category of the word remains unchanged. As an example, the plural form of the noun in English is usually formed from the singular form by adding an s. In each of these cases, the syntactic category of the word remains unchanged.

It doesn't take long to find examples where the simple rule given above doesn't fit. So there are smaller groups of nouns that form plurals in different ways:

A little more thought and we can think of apparently completely irregular plural forms, such as:

English verbs are relatively simple (especially compared with languages like Finnish which has over 12,000 verb inflections).

NLP aspects
Languages like French or German have much more inflection than English and so it is customary to include morphological analysers in systems that process these languages. NLP systems for English often don't include any morphological process, especially if they are small-scale systems. Where English-based systems do include analysis of inflection, the regular forms of words are analysed using one of the standard techniques (for instance, Finite State Automata), while the exceptions (the irregular words) are each listed individually. This means that regular forms have to be entered in the dictionary only once, which can save a lot of space and data entry if the dictionary holds a lot of syntactic and semantic information.

As was seen above, inflection doesn't change the syntactic category of a word. Derivation does change the category. Linguists classify derivation in English according to whether or not it induces a change of pronunciation. For instance, adding the suffix ity changes the pronunciation of the root of active so the stress is on the second syllable: activity. The addition of the suffix al to approve doesn't change the pronunciation of the root: approval.

NLP aspects
The obvious use of derivational morphology in NLP systems is to reduce the number of forms of words to be stored. So, if there is already an entry for the base form of the verb sing, then it should be possible to add rules to map the nouns singer and singers onto the same entry. The problem her is that the detection of the derivation of singer from sing must allow also allow the morphological analyser to contribute the information that is special to singer. This seems a little obscure, but an example will make it clearer. The addition of er to the word indicates that it is a person who is undertaking the action. This semantic information must be added to the stored information from the dictionary entry for the root form sing so that the correct meaning of the sentence can be found. This seems fine, but suppose the next two words are: recorder and dragster. The use of the er morpheme can't be taken to necessarily mean someone who undertakes the action represented by the root form.

Any linguistic processing is likely to have an effect of uncovering potential ambiguity, particularly where humans wouldn't expect it. Derivational morphological analysers can do this quite easily, because they are always attempting to reduce words to smaller units. So the word really might be analysed as both really (ie the word itself) and the as re + ally. This introduced ambiguity may be eliminated by later, syntactic processing but nonetheless, it does mean that there has to be more processing (and so a slower system).

Derivational morphology is particularly useful for Machine Translation. Successful MT has to process large quantities of text which can contain many previously unseen words. Some words are neologisms (that is, newly-coined words). If the analyser can reduce these words to their base form, it may be able to translate that and, in effect, coin a new word in the target language by simply following rules. To give a couple of examples: neologisms often have a proper name as their root. A knowledge of how Thatcherite and Majorism were formed from proper names could enable an MT system to translate them into an idiomatic equivalent in the target language.

Semi-affixes and combining forms
Semi-affixes are morphemes that are bound but which retain a word-like quality. Examples are: anti-, counter-, -like and -worthy. So we can have:

Combining forms are even more word-like than semi-affixes and frequently occur in technical literature, for instance Indo-European or gastro-enteritis. Some words can be made up entirely from bound forms, but without a free morpheme, eg franco-phile.

NLP aspects
As with derivational morphology, semi-affixes and combining forms can be analysed into their morphemes and, as with derivational morphology, it can be used to analyse previously unseen words. Hyphenation is a particular problem for languages like English and German and an understanding of semi-affixes and combining forms can contribute to identifying optional and likely hyphenation points when processing text.

A clitic is an element that behaves like an affix and a word. However, they are quite complicated in that they are also part of word formation. Unlike other morphological phenomena, clitics occur in a syntactic structure and their attachment to words isn't part of the word formation rules like the rest of morphology.

We'll expand on this in parts.

"A clitic is an element that behaves like an affix and a word."
English has an obvious clitic, the "'s" used to denote the possessive (sometimes known as the genitive grammatical case). Linguists call it an enclitic which means that it is a clitic that attaches to the right of the word, as does a suffix.

"However, they are quite complicated in that they are also part of word formation."
The "'s" is attached ("glued on") to the word or phrase to which the possessive is related.

"Unlike other morphological phenomena, clitics occur in a syntactic structure and their attachment to words isn't part of the word formation rules like the rest of morphology."
The "'s" is attached to a specific constituent irrespective of where it occurs in the sentence. In the following two examples, "'s" is attached first to a noun and then to a preposition:
  1. the girl's penguin
  2. the car I bumped into's headlight
The linguists' claim is that the "'s" is produced from the lexicon to express the possessive relation, but that its position is determined by the syntactic structure of the utterance.

English also shows a different source of cliticization. Some words can be reduced to a shorter form. For instance, I am in Birmingham can be reduced to I'm in Birmingham and It is a spring chicken can be reduced to I's a spring chicken. Note that the word being reduced has its own syntactic category and would feature in its own right in any syntactic analysis of a sentence.

NLP aspects
Cliticization is an interesting problem for NLP. Conventional NLP systems are modular and so have distinct morphological, syntactic and semantic processing modules. However, clitics like "'s" can't be satisfactorily analyses at just one level. A morphological analyser has to be able to separate the clitic from its attached morpheme, but it cannot do this correctly unless it knows about the syntactic structure of the utterance. In a conventional NLP system architecture, syntactic structure is not available to a morphological analyser.

There are various methods that could be used to alleviate this problem, such as passing as many alternatives as possible from the morphological analyser to the syntactic parser and hoping that the latter can resolve the ambiguities. Another method might be to attempt to operate morphological and syntactic analysis in parallel - or perhaps morphology and syntax are imperfect ways of describing language and we should find a better descriptive model.

Criticisms of morphology

Morphology has been a part of mainstream linguistics for sixty years or more. As is apparently the way with all linguistic theories, the passage of time serves only to uncover more and more shortcomings in the theory and further elaborations that seek to strengthen the original theory - even to the stage where its so heavily burdened as to collapse altogether.

Morphology has certainly been extended and is a field in its own right with an extensive technical vocabulary such as morph and allomorph (to mention but the two most common). For a more developed account, see Lyons (1968; pp 180-194) and for an account of some of the difficulties with morphemes, see Palmer (1971, pp 187-199) who wrote:

"What is clear nowadays, however, is that the morpheme concept is only of limited value. It can certainly display the minimal units of grammatical analysis in a vast amount of language data. The irregularities of English are not, after all, very many. And when it comes to the analysis of the agglutinative languages, the morpheme concept is invaluable, as these languages are, as it were, tailor-made for it. But when we consider the difficulties of morphemic identification as a whole... it is clear that the concept is not as all-embracing as it has sometimes be made out to be."