Introduction to AI  Week 4
Neural Nets
Naive Bayes
 The human brain can be viewed (in a very simplified
model) as a network between neurons
 Simple probabilistic reasoning with strong
assumptions can lead to powerful prediction
Neural Networks
Neural Networks have the following features.
 They are inspired by nature. The human mental capabilities are
generated by the brain, which can be viewed as a large set of
neurons (10^{10} to 10^{11} many), each connected to around
10^{3} to 10^{4} many others.
 Unlike to expert systems or decision trees neural nets do not make use
of an explicit symbolic representation.
 Each individual neuron behaves in a very simple way.
A Neuron
A Neuron (Cont'd)
 A neuron receives inputs from other neurons,
x_{1},x_{2},... ,x_{n}, which are weighted by weights
w_{1},w_{2},... ,w_{n}.
 In the simplest model, a neuron takes its
weighted inputs and sums them up
S=sum_{i}^{n}w_{i}* x_{i}.
 Then it checks whether the sum is above a given threshold
S>= Threshold.
 If yes, it sends out the value 1 (excitatory).
 If not, it sends out the value 1 (inhibitory).
Building a Neural Net
A neural net consists of
 a set of input neurons, a set of
output neurons, and a set of
hidden neurons,
 a set of directed links which connect two neurons each,
 a set of weights (for each link one).
In the simple case, which we will consider in the following, the set
of input neurons, the set of output neurons, and the set of hidden
neurons are given. Furthermore the topology is
given, that is, the programmer has fixed initially which
neuron is connected to which others.
The weights have to be learned.
A Perceptron
Let's assume we have a number i of input nodes and a single output
node, a perceptron. We have to determine the weights such that the
output is
1 
if sum w_{i}x_{i} >= threshold 
1 
if sum w_{i}x_{i} < threshold 
We call this value sign(sum w_{i}x_{i}).
When a particular setting of weights is given and we get a new
training example (set of inputs) we do the following with a fixed
socalled learning rate lambda.
:
 If the desired output and the actual output values are equal, do
nothing.
 If the actual output is 1, but should be 1, increase the
weights, that is, w_{i} >
w_{i}+delta w_{i} with delta w_{i} = lambda* (d_{i}O_{i})x_{i}.
 If the actual output is 1, but should be 1, decrease the
weights, that is, w_{i} >
w_{i}+delta w_{i} with delta w_{i} = lambda* (d_{i}O_{i})x_{i}.
Example  Perceptron
x_{1} 
x_{2} 
Output 
 
 
 
1.0 
1.0 
1 
9.4 
6.4 
1 
2.5 
2.1 
1 
8.0 
7.7 
1 
0.5 
2.2 
1 
7.9 
8.4 
1 
7.0 
7.0 
1 
2.8 
0.8 
1 
1.2 
3.0 
1 
7.8 
6.1 
1 
Example  Perceptron (Cont'd)
We want to find a way to separate the positive from the negative examples.
Example  Perceptron (Cont'd)
 Assume initial (guessed) weights as [0.75,0.5,0.6].
 With the first value we get:
f(net)^{1}=f(0.75* 1.0 + 0.5* 1.0 0.6* 1) = 1, hence no
correction
 f(net)^{2}=f(0.75* 9.4 + 0.5* 6.4 0.6* 1) = 1, however
the result should be 1, so we have to apply the learning rule,
i.e.,
w_{i}> w_{i}+lambda*(dsign(w_{i}* x_{i}))x_{i}, that is,
w_{1}> 0.75 +0.2*(11)* 9.4=3.01, likewise
w_{2}=2.06 and w_{3}=1.00.
 This will be repeated for all input values and we will get the final
weights satisfying 1.3* x_{1} +1.1* x_{2}10.9=0
See [Luger, p. 424ff].
Properties of a Perceptron
 The choice of the learning rate lambda
should not be too big (you will miss the true
value), or too small (learning will be very
slow). It is usual to decrease the learning rate slowly over time.
 All that can be learned is a straight line to separate positive
from negative examples. But what, if the positive and negative
examples cannot be separated by a straight line?
 Use more sophisticated threshold functions.


hard limiting and bipolar threshold 
sigmoidal and unipolar threshold 
Sigmoidal Thresholds
 First the weighted inputs are summed up:
net=sum_{i=1}^{n}w_{i}* x_{i}.
 Second the threshold function is applied:
output=sigma(net)=1/ 1+ e^{lambda* net}
Why Sigmoidal Thresholds?
 Nice theoretical properties (differentiable)
 Can approximate hardlimiting threshold function well for
large lambda.
For learning (training) a network, use backpropagation. Try to
minimise the squared error
Error = ½sum_{i}(d_{i}O_{i})^{2}
Multi Layered Neural Networks
In order to overcome the limitations of the perceptron, create more
complicated networks.
 The perceptron can be viewed as a network of two layers, the
input layer and the output layer.
 Any node, which is neither an input nor an output node, is
called a hidden node.
 It is possible to create networks with many hidden nodes.
 A standard approach is to have a layered approach, in which
all nodes are partitioned in layers and each layer n feeds in
the next layer n+1 only.
Three Layered Neural Networks
A simple, but already very powerful topology of nets is the three
layered topology, in which there is one input layer, one hidden layer,
and one output layer.
Three Layered Neural Networks (Cont'd)
A typical topology of a threelayered neural network is:
 All input nodes are connected with all nodes in the hidden
layer.
 All nodes in the hidden layer are connected with the output
layer.
 No further nodes are connected.
Since the number of input and output nodes is often determined by the
problem to be solved, the only decision a designer has to make is to
determine how many hidden nodes the net should have. The weights have
then to be determined in a socalled training
phase of the net.
Backpropagation
Since we have multiple outputs, we want to minimise
Error = ½sum_{i in T}sum_{k in
Outputs}(d_{i,k}O_{i,k})^{2}
with:
 T is the set of training examples (given in form of an input
list and an output list).
 d desired output
 O actual output
Backpropagation Approach
 Initialise the weights randomly with small numbers
(e.g. between 0.05 and 0.05.)
 Until termination condition reached, repeat for each training
example x:
 Take a training example x and compute the corresponding
outputs.
 For each output node k calculate its error term
delta_{k}=lambda* O_{k}* (1O_{k}) * (d_{k}O_{k})
 For each hidden node h calculate its error term
delta_{h}=O_{h}* (1O_{h}) * sum_{k in Outputs}
w_{k,h}delta_{k}
 Update weights w_{j,i} > w_{j,i}+delta w_{j,i} with
delta w_{j,i} = delta_{j}* x_{j,i}
Intuitive Explanation for the Updates
 [(b)] delta_{k}=lambda* O_{k}* (1O_{k}) * (d_{k}O_{k})
The change of weights to the output nodes should be proportional to
the difference between desired and actual output d_{k}O_{k},
multiplied with the derivative of the sigmoid function.
Note, with sigma(x)=1/ 1+ e^{lambda* x} you get
sigma'(x)=lambda e^{lambda* x} / (1+e^{lambda
* x})^{2}, that is, sigma'(x) =
lambda...igma(x)* (1sigma(x))
 [(c)]
delta_{h}=O_{h}* (1O_{h}) * sum_{k in Outputs}
w_{k,h}delta_{k}
The value for a hidden node is updated similarly. However,
since there is no error value which directly tells how to update it,
the weighted average of its contribution to the errors is taken,
where w_{k,h} is the weight between the hidden node h and the
output node k.
Naive Bayes Approach
Let a list of attribute values [a_{1},a_{2},... ,a_{n}] be
given, e.g., [Outlook=sunny, Humidity=high, Wind=strong]
. The task is to predict a target concept
PlayTennis in form of a yes/no decision.
Day 
Outlook 
Temperature 
Humidity 
Wind 
PlayTennis 
 
 
 
D1 
Sunny 
Hot 
High 
Weak 
No 
D2 
Sunny 
Hot 
High 
Strong 
No 
D3 
Overcast 
Hot 
High 
Weak 
Yes 
D4 
Rain 
Mild 
High 
Weak 
Yes 
D5 
Rain 
Cool 
Normal 
Weak 
Yes 
D6 
Rain 
Cool 
Normal 
Strong 
No 
D7 
Overcast 
Cool 
Normal 
Strong 
Yes 
D8 
Sunny 
Mild 
High 
Weak 
No 
D9 
Sunny 
Cool 
Normal 
Weak 
Yes 
D10 
Rain 
Mild 
Normal 
Weak 
Yes 
D11 
Sunny 
Mild 
Normal 
Strong 
Yes 
D12 
Overcast 
Mild 
High 
Strong 
Yes 
D13 
Rain 
Hot 
Normal 
Weak 
Yes 
D14 
Rain 
Mild 
High 
Strong 
No 
Naive Bayes Approach (Cont'd)
 Approach: Estimate probabilities of P(v_{j}) and P(a_{i}v_{j})
from the frequencies in the training set.
 Naively assume that the probabilities are all independent, i.e.,
we can compute P(a_{1},a_{2},... ,a_{n} v_{j}) as the product
P(a_{1}v_{j})* P(a_{2}v_{j})*...* P(a_{n}v_{j}).
 The classifier is the most probable target value, that is,
choose v_{j} so that given the attribute values
[a_{1},a_{2},... ,a_{n}], take
v_{max}=argmax_{vj in V}P(v_{ja}1,a_{2},... ,a_{n})
 With Bayes' theorem we get
v_{max}=
argmax_{vj in V}P(a_{1},a_{2},... ,a_{n}v_{j})P(v_{j})/ P(a_{1},a_{2},... ,a_{n})=
argmax_{vj in V}P(a_{1},a_{2},... ,a_{n}v_{j})P(v_{j})
 With the assumption in (2) we get
v_{max}=
argmax_{vj in V} P(v_{j})* P(a_{1}v_{j})*
P(a_{2}v_{j})*...* P(a_{n}v_{j})
Naive Bayes Approach (Cont'd)
All that is needed are the values P(a_{i}v_{j}) and P(v_{j}),
which are estimated from the training examples. In the example:
P(PlayTennis = Yes) 
= 
9/14 
= 
0.64 
P(PlayTennis = No) 
= 
5/14 
= 
0.36 
P(Wind = strong  PlayTennis = Yes) 
= 
3/9 
= 
0.33 
P(Wind = strong  PlayTennis = No) 
= 
3/5 
= 
0.60 
Assumed we see a new instance
Day 
Outlook 
Temperature 
Humidity 
Wind 
PlayTennis 
 
 
 
new 
Sunny 
Cool 
High 
Strong 
??? 
We compute:
P(Yes)* P(SunnyYes)* P(CoolYes)* P(HighYes)* P(StrongYes)= 0.0053
P(No)* P(SunnyNo)* P(CoolNo)* P(HighNo)* P(StrongNo)= 0.0206
Thus Naive Bayes chooses the attribute with the higher probability, i.e., No in this case.
Example  Text Analysis
Summary
 Neural Networks and the Naive
Bayes approach can be used in areas in which humans have
no good intuition and do not know any causal relationship.
 The methods do not give any explanation for
their answers.
 In some domains Naive Bayes has a comparable performance to
decision tree learning and neural nets.
 Applied in many areas such as recognition of pictures
Further Reading
 George F. Luger. Artificial Intelligence 
Structures and Strategies for Complex Problem Solving. 4th
Edition, Addison Wesley, 2002, Chapter 10.
 Tom M. Mitchell. Machine
Learning. McGrawHill, 1997, Chapters 4 and 6.
 S. Russell, P. Norvig. Artificial Intelligence  A
Modern Approach. 2nd Edition, Pearson Education,
2003. Chapter 20.
 Many more general AI texts and specialised books and articles.
 Specialised lecture on Neural Nets.
© Manfred Kerber, 2004, Introduction to AI
24.4.2005
The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/Teaching/AI/l4.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/