Research topics in NLP

In the section on the state-of-the-art of NLP, I wrote:
"It would be incorrect to think that the state-of-the-art has solved many of the problems of NLP. Progress has been made and we understand better some of the limitations of the methods we have been using. Research will move on to investigate other promising techniques, formalisms and application areas. What these will be is a matter of predicting the future..."
Predicting the future has the disadvantage that one can soon be proved wrong. In this section I take the risk and predict that the research of the near future will look at the following topics.

Ill-formed input
Humans are very good a making sensse of inputt poor, whatefer reason being for the incorrectness of the language. The sentence you have just read includes several obvious slips which make it seems strange: some grammatical (eg whatefer reason being should read as whatever the reason being); some because the word-order is "incorrect" (eg inputt poor should read as poor input) and some because there are spelling errors. Computer-based systems have been generally very poor at processing ill-formed input. As soon as the input contains the kinds of errors illustrated above, the input is rejected as unsatisfactory. There have been attempts to construct more robust systems, but these efforts have generally lead to slow systems that introduce inaccuracies into the analysis if the text.

Distributed and parallel models
On reflection (or introspection) most people would agree that their brains work on language in several ways, in parallel. Most obvious is the claim that humans perform a partial syntactic analysis of the beginning of the language they are hearing and then begin semantic analysis without waiting for the end of the sentence. This intuition is backed up by psycholinguistic studies, which show that people commit themselves to a particular semantic interpretation quite early in sentence, and this guides the syntactic analysis of the rest of the sentence.

One obvious growth area then will be the development of parallel models that use a variety of linguistic information. This will involve development of appropriate data structures to represent the interconnection of differing types of linguistic information and the development of algorithms for searching for applying appropriate knowledge at the right times during processing, without leading to the system becoming deadlocked.

Understanding paragraph and longer texts
Present systems largely work by analysing one sentence at a time. However, humans usually interact with each other using more extensive "texts". For instance, a phone conversation to order a pizza will require several sentences. The information to be collected (ie summarized) from such as dialogue may be quite simple, and could be recorded on a simple form. By analogy, the form could be represented in the computer by quite a simple data structure, for instance a frame or semantic net.

Understanding more complex texts, dialogues, etc is more difficult. When you read these notes, you are not merely storing a linguistic representation of each sentence in turn, but are assimilating the information/knowledge into what you already know about the subject. This process probably also includes a process of compressing information and discarding irrelevant text. As yet, computers are incapable of this kind of assimilation. However, when they are able to do so, there will be benefits from systems that can learn and re-represent information in the sense that for other purposes. As a brief example, perhaps you would like a program that could read through these notes and produce a summary of each from each of the kind of material that you should learn for exam revision?


© P.J.Hancox@bham.ac.uk