This section has three parts. In the first part, some basic terms in morphology is introduced, in particular, morpheme, affix, prefix, suffix, boundand free forms. The second reviews conventional ways of grouping languages, such as isolating, agglutinative and inflecting. The final section looks at some morphological processes, concentrating only on those of greater relevance to natural language engineering.
Linguistics sets out to describe language. Any description needs some terminology with which to set out its description. We can think of this as the technical vocabulary of the discipline. Natural languages have their own terms to describe themselves. For instance, we colloquially talk about "words", "phrases", "sentences" and "paragraphs". Do we know what these words mean?
We'll look at just the definition of the word. In text like this, we can easily spot "words" because they are separated from each other by spaces or by punctuation. However, if you record ordinary, conversational speech, you will find that there are no breaks between words. In spite of this, we could isolate units which we use in speech again and again, but in different combinations. This suggests that there is a small unit something like a word. But just how do we define a "word"? We will all agree that black and bird are words. Is blackbird one word or two words? Is blackbirds the same word as blackbird or a separate word?
There are no easy answers to these questions. The situation is more complicated because
Morphology is the study of the structure and formation of words. Its most important unit is the morpheme, which is defined as the "minimal unit of meaning". (Linguistics textbooks usually define it slightly differently as "the minimal unit of grammatical analysis".) Consider a word like: "unhappiness". This has three parts:
There are three morphemes, each carrying a certain amount of meaning. un means "not", while ness means "being in a state or condition". Happy is a free morpheme because it can appear on its own (as a "word" in its own right). Bound morphemes have to be attached to a free morpheme, and so cannot be words in their own right. Thus you can't have sentences in English such as "Jason feels very un ness today".
Armed with these definitions, we can look at ways used to classify languages according to their morphological structure.
It was suggested above that English is, from the morphological point-of-view, quite a straightforward language. The implication is that other languages behave in rather a different ways and this is the basis of this classification scheme. Linguists of earlier generations were quite interested in producing family trees of languages to show which modern languages descended from which earlier ones and, perhaps, even being able to reconstruct lost languages. Morphological structure is just one way of grouping languages.
There are usually three classes in this classification.
gou bú ài chi qingcàimay be literally translated as:
dog not like eat vegetableDepending on the context, it can mean any of the four following sentences:
the dog did not like to eat vegetables
the dogs do not like to eat vegetables
the dogs did not like to eat vegetables
dogs do not like to eat vegetables
To complete our example, we need a Turkish noun, in this case ev which means "house". From this noun we can make the following words:
(Notice that the possessive morpheme i is regularly followed by n before den.)
The important thing about this example is to notice how the morphemes all represent a "unit of meaning" and how they remain absolutely identifiable within the structure of the words. This is in contrast to what happens in the last class: the inflecting languages.
This classification has only three classes. Is it really possible to fit all the world's languages into three classes? From one way of looking at the problem, it is impossible to fit any of the languages into any of the classes, because each language is impure. That is to say, if you look hard enough, you will find inflection in mainly agglutinative languages, inflection in isolating languages, agglutination in inflectional languages and so on.
What lessons can we draw from this? I think there are two points worth making. The first is that languages vary greatly and generalizations based on the experience of only one language (such as English) are likely to be easily counter-exampled from other languages. The second is that language is a naturally occurring phenomenon and "tools" we use to study it (such as classifications and technical terms) are only tools, which may be imperfect attempts to describe something too complex for our current science.
In the example given above of unhappiness, we saw two kinds of affix, a prefix and a suffix. Just to show that languages do really vary greatly, there are also infixes. For instance the Bontoc language from the Philippines use an infix um to change adjectives and nouns into verbs. So the word fikas, which means "strong" is transformed into the verb "be strong" by the addition of the infix: f-um-ikas.
There are a number of morphological processes of which some are more important than others for NLP. The account given here is selective and unusual in that it points out the practical aspects of the processes selected.
It doesn't take long to find examples where the simple rule given above doesn't fit. So there are smaller groups of nouns that form plurals in different ways:
A little more thought and we can think of apparently completely irregular plural forms, such as:
English verbs are relatively simple (especially compared with languages like Finnish which has over 12,000 verb inflections).
Languages like French or German have much more inflection than English and so it is customary to include morphological analysers in systems that process these languages. NLP systems for English often don't include any morphological process, especially if they are small-scale systems. Where English-based systems do include analysis of inflection, the regular forms of words are analysed using one of the standard techniques (for instance, Finite State Automata), while the exceptions (the irregular words) are each listed individually. This means that regular forms have to be entered in the dictionary only once, which can save a lot of space and data entry if the dictionary holds a lot of syntactic and semantic information.
The obvious use of derivational morphology in NLP systems is to reduce the number of forms of words to be stored. So, if there is already an entry for the base form of the verb sing, then it should be possible to add rules to map the nouns singer and singers onto the same entry. The problem her is that the detection of the derivation of singer from sing must allow also allow the morphological analyser to contribute the information that is special to singer. This seems a little obscure, but an example will make it clearer. The addition of er to the word indicates that it is a person who is undertaking the action. This semantic information must be added to the stored information from the dictionary entry for the root form sing so that the correct meaning of the sentence can be found. This seems fine, but suppose the next two words are: recorder and dragster. The use of the er morpheme can't be taken to necessarily mean someone who undertakes the action represented by the root form.
Any linguistic processing is likely to have an effect of uncovering potential ambiguity, particularly where humans wouldn't expect it. Derivational morphological analysers can do this quite easily, because they are always attempting to reduce words to smaller units. So the word really might be analysed as both really (ie the word itself) and the as re + ally. This introduced ambiguity may be eliminated by later, syntactic processing but nonetheless, it does mean that there has to be more processing (and so a slower system).
Derivational morphology is particularly useful for Machine Translation. Successful MT has to process large quantities of text which can contain many previously unseen words. Some words are neologisms (that is, newly-coined words). If the analyser can reduce these words to their base form, it may be able to translate that and, in effect, coin a new word in the target language by simply following rules. To give a couple of examples: neologisms often have a proper name as their root. A knowledge of how Thatcherite and Majorism were formed from proper names could enable an MT system to translate them into an idiomatic equivalent in the target language.
Combining forms are even more word-like than semi-affixes and frequently occur in technical literature, for instance Indo-European or gastro-enteritis. Some words can be made up entirely from bound forms, but without a free morpheme, eg franco-phile.
As with derivational morphology, semi-affixes and combining forms can be analysed into their morphemes and, as with derivational morphology, it can be used to analyse previously unseen words. Hyphenation is a particular problem for languages like English and German and an understanding of semi-affixes and combining forms can contribute to identifying optional and likely hyphenation points when processing text.
We'll expand on this in parts.
English also shows a different source of cliticization. Some words can be reduced to a shorter form. For instance, I am in Birmingham can be reduced to I'm in Birmingham and It is a spring chicken can be reduced to I's a spring chicken. Note that the word being reduced has its own syntactic category and would feature in its own right in any syntactic analysis of a sentence.
Cliticization is an interesting problem for NLP. Conventional NLP systems are modular and so have distinct morphological, syntactic and semantic processing modules. However, clitics like "'s" can't be satisfactorily analyses at just one level. A morphological analyser has to be able to separate the clitic from its attached morpheme, but it cannot do this correctly unless it knows about the syntactic structure of the utterance. In a conventional NLP system architecture, syntactic structure is not available to a morphological analyser.
There are various methods that could be used to alleviate this problem, such as passing as many alternatives as possible from the morphological analyser to the syntactic parser and hoping that the latter can resolve the ambiguities. Another method might be to attempt to operate morphological and syntactic analysis in parallel - or perhaps morphology and syntax are imperfect ways of describing language and we should find a better descriptive model.
Morphology has certainly been extended and is a field in its own right with an extensive technical vocabulary such as morph and allomorph (to mention but the two most common). For a more developed account, see Lyons (1968; pp 180-194) and for an account of some of the difficulties with morphemes, see Palmer (1971, pp 187-199) who wrote:
"What is clear nowadays, however, is that the morpheme concept is only of limited value. It can certainly display the minimal units of grammatical analysis in a vast amount of language data. The irregularities of English are not, after all, very many. And when it comes to the analysis of the agglutinative languages, the morpheme concept is invaluable, as these languages are, as it were, tailor-made for it. But when we consider the difficulties of morphemic identification as a whole... it is clear that the concept is not as all-embracing as it has sometimes be made out to be."