Course 287: Lecture 8 Writing Functions that Create Parsers

Making parser-building functions
How can we understand functions which return functions as results?

Functions Defined in this Lecture.

(mk_builder label) makes a builder for abstract syntax
(mk_parser_singleton toks) makes a parser to recognise 1-element sequences
(mk_parser_seq p1 p2 b) combines two parsers in sequence

Abelson & Sussman

As I mentioned in the previous lecture, starting on page 420 you will find a discussion of parsing in the imperative paradigm rather than the pure functional form used in this lecture. You will find a discussion of Procedures as Returned Values starting at page 72.

Making parser-building functions

When we built the parser for English in the last lecture, we duplicated the text of basic functions, making systematic replacements. This suggests that we have the option of saving ourselves some work by generalising in the much the same way as we did when we created map_list in lecture 4 and reduce in lecture 5.

What we really want is a function for making parsers, that is to say, a function which will take the some grammatical specifications and give us a parser to match those specifications. However parsers are functions. We have already seen in lecture 4 how a function can be passed as an argument to another function, but we will now need to define a function which returns a function as value. Consider:

(define parse_determiner
    (lambda (list_of_tokens)
        (if (member? (car list_of_tokens) '(a the))    ; (1) determiner there?
            (cons_parse                                ; (2) yes! make parse
                (car list_of_tokens)                   ; (3) tree
                (cdr list_of_tokens))                  ; (4) unparsed list
            #f ) ))                                    ; (5) fail

(define parse_noun
    (lambda (list_of_tokens)
        (if (member? (car list_of_tokens) noun)
            (cons_parse (car list_of_tokens) (cdr list_of_tokens))
            #f ) ))

These differ in one place - parse_determiner has the list '(a the) where parse_noun   has the non-local variable noun . If we wanted to generalise, we should create a function that takes a list of objects and returns a parser which recognises sequences beginning with that list. We have seen that we must write a function that returns as value a function. We can do this using a lambda   expression.

The nature of the generalisation is to go from defining a parser which recognises that a sequence of tokens begins with a noun, to defining a parser which recognises that a sequence of tokens begins with any one of a given class of words (for example nouns, prepositions adjectives...). To do this we will need to replace the non-local variable noun   by a variable which is an argument of a function. The resulting parser-maker will create a parser which recognises a sequence beginning with a token which is any one of a given list, which we shall call " class_of_tokens  ".

mk_parser_singleton makes a parser for 1-element sequences

So, here is the function, which I call mk_parser_singleton   because calling it makes a parser for a language consisting of a finite set of sequences of tokens of length one.

Here (1) the argument of mk_parser_singleton   is a "class" of tokens, represented by a list (for example '(a the)). The result will be a parser which recognises a determiner.

And (2) this function returns a lambda   expression, which is a parser for the language. So, having checked whether the input-list is null (3)the parser looks (4) to see if the first element in the list of tokens is a member of the class of tokens. If it is, then (5) it makes a parse-record, consisting (6) of the first element of the list (being trivially a parse-tree specifying what was found) and (7) the remaining elements of the list (being what is left unparsed). If (8) the first element of the list of tokens is not a member of the class of tokens, then the function returns #f to indicate failure.

(define mk_parser_singleton
    (lambda (class_of_tokens)                       ; (1)
        (lambda (list_of_tokens)                    ; (2)
                (( null? list_of_tokens) #f)        ; (3)
                ((member? (car list_of_tokens)      ; (4)
                 (cons_parse                        ; (5)
                     (car list_of_tokens)           ; (6)
                     (cdr list_of_tokens))          ; (7)
                (else #f ) ))                       ; (8)

We will need the definition of member?   from Lecture 5:

(define (member? x list)
     (if (null? list) #f
         (if (equal? x (car list)) #t
              (member? x (cdr list)))))

If we are using UMASS Scheme we can define a record class for parses as follows:

(define class_parse (record-class 'parse '(full full)))

(define cons_parse   (car class_parse))
(define sel_parse    (caddr class_parse))
(define tree_parse   (car sel_parse))
(define rest_parse   (cadr sel_parse))

What mk_parser_singleton   does then is return the lambda  -expression:

(lambda (list_of_tokens)
    (if ( member?   (car list_of_tokens) class_of_tokens)
        (cons_parse (car list_of_tokens) (cdr list_of_tokens))
        #f ) )

but this is just the same as the body of parse_noun   with noun   replaced by class_of_tokens . Note that UMASS Scheme, not being an interpreter, actually creates a machine-code block which is equivalent to the lambda  -expression above.

    (mk_parser_singleton '(cat dog canary))

prints out as follows:

    <Compiled function: <lambda in mk_parser_singleton > >

Here again is the body of parse_noun  :

    (if (member? (car list_of_tokens) noun) list_of_tokens
        #f ) ))

We can now use our new function to redo some of our previous grammatical definitions more succinctly:

(define parse_noun
   (mk_parser_singleton '(cat dog child woman man bone cabbage canary)))

(example '(parse_noun '(the fat)) #f)
(example '(parse_noun '(cat eats))
    (cons_parse 'cat '(eats)))

(define parse_determiner
   (mk_parser_singleton '(a the)))

(define parse_verb
    (mk_parser_singleton '(likes eats hugs)))

Now, using this new definition to make parse_noun  ,

(define parse_noun
   (mk_parser_singleton '(cat dog child woman man bone cabbage canary)))

(define parse_determiner
   (mk_parser_singleton '(a the)))

(define parse_verb
    (mk_parser_singleton '(likes eats hugs)))

we will find that:

(example '(parse_noun '(the fat)) #f)
(example '(parse_noun '(cat eats))
    (cons_parse 'cat '(eats)))
(example '(parse_noun '()) #f)

How can we understand functions which return functions as results?

We can understand what happens using the substitution model for evaluation of Scheme. When we do (mk_parser_singleton '(a the)) substitute '(a the) for class_of_tokens   in the body of mk_parser_singleton  , obtaining:

    (lambda (list_of_tokens)
      (if (member? (car list_of_tokens) '(a the))
                 (car list_of_tokens)
                 (cdr list_of_tokens))
        #f ) )

This is then made to be the value of parse_determiner. Likewise, when we create the parser parse_noun   we obtain a distinct lambda   expression. This model is adequate for understanding Scheme as a functional language.

Warning: this will work fine in Scheme. It will not work in most programming languages, which either do not allow functions/procedures to return functions/procedures as results, or, if (like C) they do, they do not handle the free variables correctly.

Combining two parsers sequentially.


Now our other kind of parsing involved combining two existing parsers to obtain a new parser which used the first existing parser to recognise an initial subsequence, and the second existing parser to recognise what remained. We can also write a function which is a generalisation of this process:

Recall our definition of parse_sentence  :

(define parse_sentence                                       ;(1)
    (lambda(list_of_tokens)                                  ;(2)
        (let ((p1 (parse_noun_phrase list_of_tokens)))       ;(3) **
            (if p1                                           ;(4)
                (let ((p2 (parse_verb_phrase                 ;(5) **
                             (rest_parse p1))))              ;(6)
                    (if p2                                   ;(7)
                        (cons_parse                          ;(8)
                            (list 'sentence                  ;(9)  **
                                (tree_parse p1)              ;(10)
                                (tree_parse p2))             ;(11)
                            (rest_parse p2)                  ;(12)
                            )                                ;(13)
                        #f)                                  ;(14)
                    ) ;end let                               ;(15)
                #f)                                          ;(16)
            );end let                                        ;(17)
        ); end lambda                                        ;(18)
    ); end def. parse_sentence                               ;(19)

This function can be generalised by making the two parsers parse_noun_phrase [line (3) above] and parse_verb_phrase [line (5) above] be parameters of a new function. However it is also necessary to generalise on the building of parse-tree that is implemented starting at line (9). Clearly, we could generalise on 'sentence, by making it a parameter in which is passed the name of the grammatical structure that has been recognised. However we can obtain greater generality by passing in a builder function, which will create a semantically appropriate representation. The need for this is seen if we consider a fragment of a grammar for a language like Pascal.

     expression -> term + expression 
The tree we would want to build to represent an expression such as 2+3 would probably be something like '(+ 2 3), precisely because in writing a compiler it is important to have immediate access to the operation involved, in this case "+". So, in building abstract syntax (a parse tree), we should not be slavishly stuck to the order in which constructs appear in the external syntax.

Thus, we can define mk_parser_seq as (1) taking two parsers parser_1 and parser_2 as parameters, together with a parse-tree builder build_parse.

(define mk_parser_seq
    (lambda (parser_1 parser_2 build_parse)           ;(1)
        (lambda (list_of_tokens)                      ;(2)
            (let ((p1 (parser_1 list_of_tokens)))     ;(3)
                (if p1                                ;(4)
                    (let ((p2 (parser_2               ;(5)
                                 (rest_parse p1))))   ;(6)
                        (if p2                        ;(7)
                            (cons_parse               ;(8)
                                (build_parse          ;(9)
                                    (tree_parse p1)   ;(10)
                                    (tree_parse p2))  ;(11)
                                (rest_parse p2)       ;(12)
                                )                     ;(13)
                            #f)                       ;(14)
                        ) ;end let                    ;(15)
                    #f)                               ;(16)
                );end let                             ;(17)
            ); end lambda                             ;(18)
        ); end lambda                                 ;(19)
    ); end def. mk_parser_seq                         ;(20)

mk_builder helps build the abstract syntax tree.

Given mk_parser_seq   we are almost ready to redo our entire grammar. We do however need a utility function to build our abstract syntax - this will be passed as the build_parse   argument of mk_parser_seq  .

(define (mk_builder label)
   (lambda l (cons label l))

Now we can redefine our parsers in two lines of Scheme each.

(define parse_noun_phrase
            (mk_builder 'noun_phrase)))

(define parse_verb_phrase
            (mk_builder 'verb_phrase)))

(define parse_sentence
            (mk_builder 'sentence)))

Let us test our our new sentence parser:

    '(parse_sentence '(the cat eats the canary))
        '(sentence (noun_phrase the cat)        ; parse-tree
            (verb_phrase eats
                (noun_phrase the canary)))       ; end of parse-tree
        '()                                      ; unparsed
        )                                        ; end of parse

If we give it a non-sentence like '(canary the cat eats) we get:

    '(parse_sentence '(canary the cat eats))

Error Reporting - An exercise for the reader

Suppose we have an actual user for our parser-makers. That person is soon going to come back saying "These parsers are making my customers unhappy, since if they are given an ungrammatical sentence they just fail, giving no indication of what's wrong". As an exercise, you could adapt the parser-making functions above always to return a parse-record that contains a status report component, #t or #f. If the status is #f, then the rest_parse component of the parse-record contains the unparsed text at the place in which the error was detected. Moreover, another new component will contain a list of the possible tokens that the parser would have accepted as a legal continuation.