Introduction to AI - Week 4

Neural Nets
Naive Bayes

Neural Networks

Neural Networks have the following features.

A Neuron


A Neuron (Cont'd)


Building a Neural Net

A neural net consists of

In the simple case, which we will consider in the following, the set of input neurons, the set of output neurons, and the set of hidden neurons are given. Furthermore the topology is given, that is, the programmer has fixed initially which neuron is connected to which others.

The weights have to be learned.


A Perceptron

Let's assume we have a number i of input nodes and a single output node, a perceptron. We have to determine the weights such that the output is
 1  if sum wixi >= threshold
 -1  if sum wixi < threshold

We call this value sign(sum wixi). When a particular setting of weights is given and we get a new training example (set of inputs) we do the following with a fixed so-called learning rate lambda. :

Example - Perceptron

 x1  x2  Output
-------- ------------- -------------
  1.0  1.0  1
  9.4  6.4  -1
  2.5  2.1  1
  8.0  7.7  -1
  0.5  2.2  1
  7.9  8.4  -1
  7.0  7.0  -1
  2.8  0.8  1
  1.2  3.0  1
  7.8  6.1  -1


Example - Perceptron (Cont'd)

We want to find a way to separate the positive from the negative examples.

Example - Perceptron (Cont'd)

  1. Assume initial (guessed) weights as [0.75,0.5,-0.6].
  2. With the first value we get:
    f(net)1=f(0.75* 1.0 + 0.5* 1.0- 0.6* 1) = 1, hence no correction
  3. f(net)2=f(0.75* 9.4 + 0.5* 6.4- 0.6* 1) = 1, however the result should be -1, so we have to apply the learning rule, i.e., wi-> wi+lambda*(d-sign(wi* xi))xi, that is, w1-> 0.75 +0.2*(-1-1)* 9.4=-3.01, likewise w2=-2.06 and w3=-1.00.
  4. This will be repeated for all input values and we will get the final weights satisfying 1.3* x1 +1.1* x2-10.9=0 See [Luger, p. 424ff].

Properties of a Perceptron


Sigmoidal Thresholds


Why Sigmoidal Thresholds?

For learning (training) a network, use backpropagation. Try to minimise the squared error Error = ½sumi(di-Oi)2


Multi Layered Neural Networks

In order to overcome the limitations of the perceptron, create more complicated networks.


Three Layered Neural Networks

A simple, but already very powerful topology of nets is the three layered topology, in which there is one input layer, one hidden layer, and one output layer.


Three Layered Neural Networks (Cont'd)

A typical topology of a three-layered neural network is: Since the number of input and output nodes is often determined by the problem to be solved, the only decision a designer has to make is to determine how many hidden nodes the net should have. The weights have then to be determined in a so-called training phase of the net.

Backpropagation

Since we have multiple outputs, we want to minimise Error = ½sumi in Tsumk in Outputs(di,k-Oi,k)2 with:

Backpropagation Approach

  1. Initialise the weights randomly with small numbers (e.g. between -0.05 and 0.05.)
  2. Until termination condition reached, repeat for each training example x:
    1. Take a training example x and compute the corresponding outputs.
    2. For each output node k calculate its error term
      deltak=lambda* Ok* (1-Ok) * (dk-Ok)
    3. For each hidden node h calculate its error term
      deltah=Oh* (1-Oh) * sumk in Outputs wk,hdeltak
    4. Update weights wj,i -> wj,i+delta wj,i with
      delta wj,i = deltaj* xj,i

Intuitive Explanation for the Updates


Naive Bayes Approach

Let a list of attribute values [a1,a2,... ,an] be given, e.g., [Outlook=sunny, Humidity=high, Wind=strong] . The task is to predict a target concept PlayTennis in form of a yes/no decision.
 Day  Outlook  Temperature  Humidity  Wind  PlayTennis
-------- ------------- -------------
 D1  Sunny  Hot  High  Weak  No
 D2  Sunny  Hot  High  Strong  No
 D3  Overcast  Hot  High  Weak  Yes
 D4  Rain  Mild  High  Weak  Yes
 D5  Rain  Cool  Normal  Weak  Yes
 D6  Rain  Cool  Normal  Strong  No
 D7  Overcast  Cool  Normal  Strong  Yes
 D8  Sunny  Mild  High  Weak  No
 D9  Sunny  Cool  Normal  Weak  Yes
 D10  Rain  Mild  Normal  Weak  Yes
 D11  Sunny  Mild  Normal  Strong  Yes
 D12  Overcast  Mild  High  Strong  Yes
 D13  Rain  Hot  Normal  Weak  Yes
 D14  Rain  Mild  High  Strong  No

Naive Bayes Approach (Cont'd)

  1. Approach: Estimate probabilities of P(vj) and P(ai|vj) from the frequencies in the training set.
  2. Naively assume that the probabilities are all independent, i.e., we can compute P(a1,a2,... ,an| vj) as the product P(a1|vj)* P(a2|vj)*...* P(an|vj).
  3. The classifier is the most probable target value, that is, choose vj so that given the attribute values [a1,a2,... ,an], take vmax=argmaxvj in VP(vj|a1,a2,... ,an)
  4. With Bayes' theorem we get
    vmax= argmaxvj in VP(a1,a2,... ,an|vj)P(vj)/ P(a1,a2,... ,an)= argmaxvj in VP(a1,a2,... ,an|vj)P(vj)
  5. With the assumption in (2) we get
    vmax= argmaxvj in V P(vj)* P(a1|vj)* P(a2|vj)*...* P(an|vj)

Naive Bayes Approach (Cont'd)

All that is needed are the values P(ai|vj) and P(vj), which are estimated from the training examples. In the example:
 P(PlayTennis = Yes)  =  9/14  =  0.64
 P(PlayTennis = No)  =  5/14  =  0.36
 P(Wind = strong | PlayTennis = Yes)  =  3/9  =  0.33
 P(Wind = strong | PlayTennis = No)  =  3/5  =  0.60

Assumed we see a new instance
 Day  Outlook  Temperature  Humidity  Wind  PlayTennis
-------- ------------- -------------
 new  Sunny  Cool  High  Strong  ???

We compute:

P(Yes)* P(Sunny|Yes)* P(Cool|Yes)* P(High|Yes)* P(Strong|Yes)= 0.0053
P(No)* P(Sunny|No)* P(Cool|No)* P(High|No)* P(Strong|No)= 0.0206

Thus Naive Bayes chooses the attribute with the higher probability, i.e., No in this case.


Example - Text Analysis


Summary


Further Reading    



© Manfred Kerber, 2004, Introduction to AI
24.4.2005
The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/Teaching/AI/l4.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/