A Morphological Analyser

The Finite State Transition Network (FSTN) was introduced in another section. This section will expand on the program given there so as to build an elementary morphological analyser.


An outline design

The previous FSTN was designed to recognise whether some input was a member of a language or not. For practical NLP we usually require more information, such as the syntactic category (often known as the part-of-speech) and perhaps other information such as tense, number, grammatical person, etc.

For this system, we want to produce two kinds of information.

  1. Syntactic category. We will represent this as a Prolog structured object, with the word as the argument and the syntactic category as the functor, eg: noun(cat), verb(trust).
  2. Grammatical features. For this system, which is designed for English, we will have: This will be represented as another Prolog structured object, with five arguments (one for each of the above grammatical features, and one argument spare for later expansion).

We have to arrange for this information to be returned every time the Finite State Machine arrives at an end-state.


The essential morphological analyser

	/* ************************************************ */
	/*                                                  */
	/*   Constants                                      */
	/*                                                  */
	/*   Here are stored any general facts that are     */
	/*   referenced in other parts of the program.      */
	/*                                                  */
	/* ************************************************ */

	final_state(9999).

	/* ************************************************ */
	/*                                                  */
	/*   arc/6                                          */
	/*      Arg 1:  Start state                         */
	/*      Arg 2:  Final state                         */
	/*      Arg 3:  [Label|Remainder of Input]          */
	/*      Arg 4:  Remainder of Input (List)           */
	/*      Arg 5:  Syntactic category                  */
	/*      Arg 6:  Grammatical features                */
	/*   Summary: defines a FSTN.                       */
	/*   Author: P J Hancox                             */
	/*   Date:   9 February 1995                        */
	/*                                                  */
	/* ************************************************ */

	arc(1,    2,     [c|S], S, _Word, _Features).
	arc(1,    4,     [g|S], S, _Word, _Features).
	arc(1,    10,    [t|S], S, _Word, _Features).

	arc(2,    3,     [a|S], S, _Word, _Features).
	
	arc(3,    8000,  [t|S], S, noun(cat), _Features).
	
	arc(4,    5,     [i|S], S, _Word, _Features).
	
	arc(5,    6,     [r|S], S, _Word, _Features).
	
	arc(6,    8000,  [l|S], S, noun(girl), _Features).
	
	arc(10,   11,    [h|S], S, _Word, _Features).
	
	arc(11,   9999,  [e|S], S0, det(the), features(_Number,
	                                      person(third), _Tense,
	                                      _Participle,   _)) :-
	     punctuation(S, S0).
	
	arc(10,   13,   [r|S], S, _Word, _Features).
	
	arc(13,   14,   [u|S], S, _Word, _Features).
	
	arc(14,   15,   [s|S], S, _Word, _Features).
	
	arc(15,   8000, [t|S], S, noun(trust), _Features).
	arc(15,   8500, [t|S], S, verb(trust), _Features).

Thus far the network is just an elaboration of what we have previously seen. It actually covers the words the, girl, girls, trust, trusts, trusting, and trusted. The next part of the network accounts for regular endings of words.

You will see that some of the state names are unusually "large" in comparison with others. I've used states from 8000 onwards to represent regular endings of words.

Nouns
This part of the network represents the simple form of the noun that forms its plural by simply adding an "s". If the word is in the singular form, the appropriate features are returned; if it is plural, the number argument is instantiated to "plural".

	% nouns
	% paradigm 1
	arc(8000, 9999, S, S0, _Word, features(number(singular),
	                                       person(_), _Tense,
	                                       _Participle,   _)) :-
	     punctuation(S, S0).
	arc(8000, 8001, [s|S], S, _Word, _Features).
	arc(8001, 9999, S, S0, _Word, features(number(plural),
	                                       person(third), _Tense,
	                                       _Participle,   _)) :-
	     punctuation(S, S0).

Verbs
This part of the network represents the simple form of the verb.

	% verbs
	% paradigm 1
	arc(8500, 9999, S, S0, verb(_Word), features(number(_),
	                                       person(_), tense(present),
	                                       participle(no),   _)) :-
	     punctuation(S, S0).
	arc(8500, 8501, [s|S], S, _Word, _Features).
	arc(8501, 9999, S, S0, verb(_Word), features(number(singular),
	                                       person(third), tense(present),
	                                       participle(no),   _)) :-
	     punctuation(S, S0).
	arc(8500, 8502, [e|S], S, _Word, _Features).
	arc(8502, 8503, [d|S], S, _Word, _Features).
	arc(8503, 9999, S, S0, verb(_Word), features(number(_),
	                                       person(_), tense(past),
	                                       participle(yes),   _)) :-
	     punctuation(S, S0).
	arc(8500, 8504, [i|S], S, _Word, _Features).
	arc(8504, 8505, [n|S], S, _Word, _Features).
	arc(8505, 8506, [g|S], S, _Word, _Features).
	arc(8506, 9999, S, S0, verb(_Word), features(number(_),
	                                       person(_), tense(present),
	                                       participle(yes),   _)) :-
	     punctuation(S, S0).


The FSTN controller

The controller (or "black box") remains the same as the one given in the previous program, except that it has to accommodate the extra arguments we defined above.

	/* ************************************************ */
	/*                                                  */
	/*   morph/5                                        */
	/*      Arg 1:  State                               */
	/*      Arg 2:  Syntactic category                  */
	/*      Arg 3:  Grammatical features                */
	/*      Arg 4:  Letters in (difference list)        */
	/*      Arg 5:  Letters out (difference list)       */
	/*   Summary: FSA machine.                          */
	/*   Author: P J Hancox                             */
	/*   Date:   9 February 1995                        */
	/*                                                  */
	/* ************************************************ */

	% 1 - terminating condition
	morph(State, Word, Features, S, S0) :-
	     final_state(Final),
	     arc(State, Final, S, S0, Word, Features).
	% 2 - recursive condition
	morph(State, Word, Features, S, S0) :-
	     arc(State, Next, S, S1, Word, Features),
	     morph(Next, Word, Features, S1, S0).

The last two arguments may seem unusual. Of these, the first should be the input to be analysed, presented as a list of characters (including any punctuation symbols. The very last is by convention almost always an empty list, which shows that we want to analyse every letter of the input list. So, examples that might be tried out are:

     | ?- morph(1, Word, Features, [g,i,r,l], []).
     | ?- morph(1, Word, Features, [t,h,e], []).
     | ?- morph(1, Word, Features, [t,r,u,s,t,e,d], []).


A practical patch

You will probably have noticed a sub-goal used in the network called punctuation/2. This represents one of the points at which NLP systems have to descend from the misty heights of abstract linguistic theory and has to concern itself with the messy nature of real input. For most applications that we could imagine, the user will type in some text and it is up to our system to remove extra space and ignore punctuation symbols, when they aren't important.

This is precisely what punctuation/2 does. It gobbles up any spaces and punctuation symbols it can find after the FSTN has found a word. The way it is written is as follows. Any possible punctuation symbols are included in a Prolog fact:

	punctuation_symbols([' ', '.', ',', ';', ':', '?', '!']).
(This should be added to the beginning of the program in the section labelled "constants".)

The code for reading over punctuation symbols has three clauses:

  1. Terminating condition 1
    When the input list is empty, then there is no more punctuation to be read over, so both arguments are empty lists.
    	% 1 - terminating condition - end of list
    	punctuation([], []).
    

  2. Terminating condition 2
    The letters to be analysed are in the first argument: [Char|Chars]. If Char is not a member of the list of punctuation symbols, then there are no more space and punctuation symbols to be gobbled-up.
    	% 2 - terminating condition - no more space/punctuation
    	punctuation([Char|Chars], [Char|Chars]) :-
    	     punctuation_symbols(Symbols),     % retrieve punctuation list
    	     \+ member(Char, Symbols).
    

  3. Recursive condition
    Again, the letters to be analysed are in the first argument: [Char|Chars]. If Char is a member of the list of punctuation symbols, then it is gobbled-up and the rest of the input (in the list Chars) is recursively processed.

    	% 3 - recursive condition - more space/punctuation
    	punctuation([Char|Chars], Chars0) :-
    	     punctuation_symbols(Symbols),     % retrieve punctuation list
    	     member(Char, Symbols),
    	     punctuation(Chars, Chars0).
    

    There remains one more piece of code, for Prologs that don't have an in-built member/2 procedure. This is the standard definition.

    	% 1 - terminating condition
    	member(Elem, [Elem|_]).
    	% 2 - recursive condition
    	member(Elem, [_|Tail]) :-
    	     member(Elem, Tail).
    


    Putting all this together gives us a working morphological analyser, which we will return to when looking at how to integrate morphological and syntactic parsing.


    © P.J.Hancox@bham.ac.uk