What is natural language?

Anyone who has learnt to program has had experience of an artificial language - a language designed for a specific purpose. In the case of the Pascal programming language this is to program computers to perform tasks specified by programmers (and also to teach the elements of good programming). Specifications of Pascal exist which attempt to state the form and range of meanings of a Pascal program.

There is a vocabulary of keywords (eg begin, end, if ..., together with rules for forming valid identifiers of the programmer's choice (eg identifiers must begin with a letter...).

There is a syntax which states the order in which elements of the vocabulary can combine. For instance this is correct Pascal:

	repeat Count := Count + 1 
	until Count = 10;
but the following has the same vocabulary, but is not syntactically correct:
	repeat until Count = 10
	Count := Count + 1;
There is a semantics. That is to say that certain combinations of the vocabulary in a certain order has a recognized meaning:
	if Count > 10 then 
		if Count < 20 then write('Count is less than 20')
			else write('Count is less than 10')
This is a bit of an unfair example, because the meaning is not immediately apparent to the reader, but to a correctly written Pascal compiler or interpreter will have no difficulty in assigning the "standard" meaning to this statement, and proving it has done so by performing the correct actions.

This doesn't directly answer the question of what natural language is, but it gives us something to contrast it against. Natural language is the language we write and speak in everyday social interaction. There are of course many varieties of natural language (Welsh, Cornish, Celtic, Gaelic, Manx and English are all used in the Great Britain and the North of Ireland). It is quite possible to argue that the spoken and the written forms of the language are different and may be largely independent. (If you don't believe this last point, try transcribing a conversation between several people and seeing if it is similar to written language. Radio plays are notoriously dissimilar to real-life drama.)

The claim of many who study natural language is that there are systems of vocabulary, syntax and semantics which can be observed (or similarly discovered) and recorded. Those working in NLP also would claim (or at least hope) that it is possible to "automate" these descriptions to produce useful systems that are based on these descriptions.

If you consider your natural language for a short while, you will be able to think of some elementary rules to describe it: for instance that its words are made up of upper and lower case letters and are bounded by space and punctuation (except in speech, where there seems to be no gap between words). You could go on to make rules for syntax, such as "the" always precedes the name of an object or concept (and you may be able to define "object" and "concept"), and you may well go on to suggest some rules of syntax, such as "her" refers to the last female human last appearing in the text (unless of course the text is about non human animals and/or boats). Well, you should be convinced by now that to create a good description of your language will take more than a few moments and will be less than easy.

You may consider the whole thing too difficult to be feasible. Don't worry, because there are many NLP systems that aim to use only parts of a natural language, for instance only the kinds of text found in aircraft maintenance manuals. The person who designs the system is creating an artificial language. This new language might be a lot more complicated than Pascal, but it is far less complicated than a full natural language.


What is linguistics?

There is a vulgar view that linguistics is concerned with the definition of the proper and correct form of a language, and thereby with the preservation of the purity of that language. This idea is only too apparent when listening to the complaints of viewers and listeners to broadcasting stations. This view of language is completely rejected by linguists for several reasons, of which the two most important ones are:

The commonly held view among linguists is that a person should speak a version of the language suitable to their background and social position. Therefore no one version of a language can be said to be superior to another. There are several kinds of linguistics, and these are briefly outlined below.

Comparative linguistics
At one level this is a form of family tree building. It is obvious to people learning several European languages that some are more similar than others and that there may be common roots. It is probably fruitless and futile to try to trace all languages back to one common base language. Another aspect of comparative linguistics is very necessary in the building of machine translation systems. To translate from one language to another it is necessary to know the points of similarity and divergence of languages. For instance in English the adjective usually goes before the noun, whereas in French it is normally in a following position:
the human condition
la condition humaine

Structural linguistics
This school is concerned mainly with discovery and enumeration. It attempts to observe all the elements of language, perhaps starting at the lowest level with the single sounds that go to make up the sound of a word. After listing all of these, it could go on to list all possible combinations. There may even be an assumption that the individual sounds in some way represent elements of meaning.

Generative linguistics
The most recent large school of linguistics stresses the idea of language working as a rule system. So the definition of a grammatical sentence is one where the set of rules can be applied to derive its structure. The usefulness of this kind of approach must be attractively obvious in the computational handling of languages, and it is this approach to linguistics that has contributed most to NLP since its revolutionary inception by Noam Chomsky in 1957.


© P.J.Hancox@bham.ac.uk