Learning in TLUs

Introduction

In this tutorial you will learn about:

Linear separability

The simplest case of a linearly separable decision problem is one consisting of two sets of points (patterns) in a 2-d vector space that belong to different classes, where the two classes can be separated by a straight line. In 3 dimensions the equivalent problem is one where the points are separable by a plane.

We can extend this idea to any number of dimensions, though 4 or 5 dimensions is harder to visualise! Planes and straight lines are just hyperplanes of dimension 1 and 2. There are hyperplanes of higher dimension. In higher dimensional cases two classes in an N dimensional space are termed "linearly separable" if they are separable by a hyperplane of dimension N-1.

More precisely a linearly separable decision problem is one for which there exists a solution of the form

\begin{displaymath}y = \sum_{i=1}^{n} \vec{w}_{i} \, \vec{x}_{i} - h
\end{displaymath} (3)

The solution is simply the combination of weights and threshold which minimises the error function that has been defined. The error function can be defined in a number of ways. In the simplest case the error could simply be the difference between the actual value of a for each pattern x and the desired value of a . In a TLU a simple way of defining the error is by thresholding a , and defining the output of the threshold function as the classification of the input pattern. So our single TLU can distinguish between two classes. The error measure now could simply be the number of incorrectly classified patterns.

So we can see that a TLU implements a linear decision boundary. In the previous tutorial we saw that by changing the weights and the threshold of the TLU we can move the decision boundary. For the TLU to solve a decision problem all the patterns from one class should be on one side of the decision boundary, and all the patterns from the other class should be on the other side.

Learning in a TLU is concerned with using an automatic procedure for adjusting the weights and threshold so that the decision boundary minimises the error function.

Linear inseparability

Clearly not all decision problems are linearly separable: they cannot be solved using a linear decision boundary. Problems like these are termed linearly inseparable. XOR is a linearly inseparable problem.

If you don't understand the concept of linear separability and inseparability try plotting points in the input space in the demonstration below. Then try and separate the two classes by altering the position of the decision boundary by hand.

The Perceptron learning rule

Now we know what sort of problem can be solved using a TLU let's think about how a TLU can learn to solve such a problem. A simple learning procedure for a TLU is called the Perceptron learning rule. First of all we need to define an error function precisely. The error function used for the perceptron training rule is based on the number of incorrectly classified points in the training set.

\begin{displaymath}E = - \sum_{x \in M} d_{x} \, \vec{w} . \, \vec{x}
\end{displaymath} (4)

where M is the set of misclassified input patterns, and d is the desired output of the TLU. This is the perceptron error criterion. If all input vectors are correctly classified the error E=0. The contribution of a misclassified input pattern to the error is its absolute distance from the decision boundary. The perceptron learning procedure minimises this error function. The learning procedure is as follows:

Where Eqs.(5) and (6) are

\begin{displaymath}\vec{w}_{i} = \vec{w}_{i} + \alpha \vec{x}_{i}
\end{displaymath} (5)

\begin{displaymath}\vec{w}_{i} = \vec{w}_{i} - \alpha \vec{x}_{i}
\end{displaymath} (6)

The training set is simply the set of input patterns that the TLU is learning to classify correctly. The value of alpha is in the interval (0,1], and is often set to 1. The algorithm will converge to a correct solution for any positive value of alpha if the problem is linearly separable. This procedure can be used to adjust the threshold as well by turning the threshold into another weight with a input value of -1. This is the way it is implemented in the demonstration below.

To see how the Perceptron learning rule works use the Demonstration. Set the training regime to be incremental. This means that the weights and threshold will be updated after each input pattern. The batch training regime means that weight changes are saved up and only made at the end of an epoch. An epoch is a complete presentation of all the input patterns in the training set. The error signals for a sequence of patterns will be different for the two training regimes. Can you explain why?

The LMS learning rule

The Least Mean Squares rule minimises an error measure called the mean squared error.

\begin{displaymath}E = \frac{1}{2} \sum_{x \in M} e_{x}^{2}
\end{displaymath} (7)

Where e_x is the error for the input pattern x . In the TLU implementation of this rule the error for each pattern is calculated by finding the difference between the desired and actual activation, i.e. the error is calculated using the activation of the TLU before it is passed through the threshold function.

\begin{displaymath}{\bf e}_{x} = {\bf d}_{x} - {\bf a}_{x}
\end{displaymath} (8)

Where d_x is the desired output (+1 or -1) and a_x is the actual activation. The update procedure is simple. The weights are updated for every input pattern using the rule

\begin{displaymath}\vec{w}_{i} = \vec{w}_{i} + \alpha \, e_{x} \, \vec{x}_
{i}
\end{displaymath} (9)

If the weights are updated after every input pattern then the decision boundary will continue to move around the point with the lowest error. If batch training is used then the procedure will settle on the solution providing the minimum error. Strictly both the perceptron learning rules and the LMS rule were only designed to operate incrementally. In this tutorial we've added batch versions to help you understand the way in the which the original (incremental) rules work.

Demonstration

Exercises

  1. Place two clusters of red and blue points in two diagonally opposite corners of the input space. Now setting the training regime to incremental, learn using the Perceptron rule. Record the final position of the weight vector. Repeat your experiment from several different starting positions. Why do the finishing positions of the decision boundary differ? When does the TLU stop learning. Why?

  2. Now move the weights so that the decision boundary no longer correctly classifies all the exemplars. Apply the LMS rule, again using the incremental training regime. You will have to gradually decline the learning rate to obtain convergence. What is the final position of the weight vector? Reset the weights and repeat the experiment. Why does the LMS rule keep on adjusting the weights even after the patterns appear to be correctly classified?

  3. Explain why the final position of the decision boundary differs between the perceptron learning rule and the LMS rule. What does this tell you about the likely generalisation abilities of the two rules?

  4. Now run the batch version of the LMS rule on the same data. Explain why the decision boundary now settles to a steady position.

  5. Explain why the LMS rule is unstable if the learning rate is set high.

  6. Now devise an experiment to compare the different convergence times of the rules. Write up your experiment, reporting the aim, method, results and your conclusions.

  7. Now reset the training data and create two classes that are linearly separable but close to one another. Devise experiments to compare the behaviour of the two learning rules in this situation. Make sure you examine the behaviour of each rule under each training regime (incremental and batch).

  8. Now devise a linearly inseparable set of training data. Devise experiments to investigate the behaviour of the learning rules now. Explain why the perceptron learning rule doesn't allow the decision boundary to settle, whereas the LMS rule does (in the batch case), but to an incorrect classification.