Introduction to AI - Week 4
Neural Nets
Naive Bayes
* The human brain can be viewed (in a very simplified model) as a network between neurons
* Simple probabilistic reasoning with strong assumptions can lead to powerful prediction
_________________________________________________________________________________________________________
Neural Networks
Neural Networks have the following features.
* They are inspired by nature. The human mental capabilities are generated by the brain, which can be viewed
as a large set of neurons (10^10 to 10^11 many), each connected to around 10^3 to 10^4 many others.
* Unlike to expert systems or decision trees neural nets do not make use of an explicit symbolic
representation.
* Each individual neuron behaves in a very simple way.
_________________________________________________________________________________________________________
A Neuron
[neuron.jpg]
_________________________________________________________________________________________________________
A Neuron (Cont'd)
* A neuron receives inputs from other neurons, x[1],x[2],... ,x[n], which are weighted by weights
w[1],w[2],... ,w[n].
* In the simplest model, a neuron takes its weighted inputs and sums them up
S=sum[i]nw[i]* x[i].
* Then it checks whether the sum is above a given threshold
S>= Threshold.
* If yes, it sends out the value 1 (excitatory).
* If not, it sends out the value -1 (inhibitory).
_________________________________________________________________________________________________________
Building a Neural Net
A neural net consists of
* a set of input neurons, a set of output neurons, and a set of hidden neurons,
* a set of directed links which connect two neurons each,
* a set of weights (for each link one).
In the simple case, which we will consider in the following, the set of input neurons, the set of output
neurons, and the set of hidden neurons are given. Furthermore the topology is given, that is, the programmer
has fixed initially which neuron is connected to which others.
The weights have to be learned.
_________________________________________________________________________________________________________
A Perceptron
Let's assume we have a number i of input nodes and a single output node, a perceptron. We have to determine
the weights such that the output is
1 if sum w[i]x[i] >= threshold
-1 if sum w[i]x[i] < threshold
We call this value sign(sum w[i]x[i]). When a particular setting of weights is given and we get a new training
example (set of inputs) we do the following with a fixed so-called learning rate lambda. :
* If the desired output and the actual output values are equal, do nothing.
* If the actual output is -1, but should be 1, increase the weights, that is, w[i] -> w[i]+delta w[i] with
delta w[i] = lambda* (d[i]-O[i])x[i].
* If the actual output is 1, but should be -1, decrease the weights, that is, w[i] -> w[i]+delta w[i] with
delta w[i] = lambda* (d[i]-O[i])x[i].
_________________________________________________________________________________________________________
Example - Perceptron
x[1] x[2] Output
-------- ------------- -------------
1.0 1.0 1
9.4 6.4 -1
2.5 2.1 1
8.0 7.7 -1
0.5 2.2 1
7.9 8.4 -1
7.0 7.0 -1
2.8 0.8 1
1.2 3.0 1
7.8 6.1 -1
[ex-perceptron1.jpg]
_________________________________________________________________________________________________________
Example - Perceptron (Cont'd)
[ex-perceptron.jpg]
We want to find a way to separate the positive from the negative examples.
_________________________________________________________________________________________________________
Example - Perceptron (Cont'd)
1. Assume initial (guessed) weights as [0.75,0.5,-0.6].
2. With the first value we get:
f(net)1=f(0.75* 1.0 + 0.5* 1.0- 0.6* 1) = 1, hence no correction
3. f(net)2=f(0.75* 9.4 + 0.5* 6.4- 0.6* 1) = 1, however the result should be -1, so we have to apply the
learning rule, i.e., w[i]-> w[i]+lambda*(d-sign(w[i]* x[i]))x[i], that is, w[1]-> 0.75 +0.2*(-1-1)*
9.4=-3.01, likewise w[2]=-2.06 and w[3]=-1.00.
4. This will be repeated for all input values and we will get the final weights satisfying 1.3* x[1] +1.1*
x[2]-10.9=0 See [Luger, p. 424ff].
_________________________________________________________________________________________________________
Properties of a Perceptron
* The choice of the learning rate lambda should not be too big (you will miss the true value), or too small
(learning will be very slow). It is usual to decrease the learning rate slowly over time.
* All that can be learned is a straight line to separate positive from negative examples. But what, if the
positive and negative examples cannot be separated by a straight line?
* Use more sophisticated threshold functions.
[bipolar-threshold.jpg] [sigmoidal-threshold.jpg]
hard limiting and bipolar threshold sigmoidal and unipolar threshold
_________________________________________________________________________________________________________
Sigmoidal Thresholds
[sigmoid.jpg]
* First the weighted inputs are summed up: net=sum[i=1]^nw[i]* x[i].
* Second the threshold function is applied:
output=sigma(net)=1/ 1+ e^-lambda* net
_________________________________________________________________________________________________________
Why Sigmoidal Thresholds?
* Nice theoretical properties (differentiable)
* Can approximate hard-limiting threshold function well for large lambda.
For learning (training) a network, use backpropagation. Try to minimise the squared error Error =
½sum[i](d[i]-O[i])2
_________________________________________________________________________________________________________
Multi Layered Neural Networks
In order to overcome the limitations of the perceptron, create more complicated networks.
* The perceptron can be viewed as a network of two layers, the input layer and the output layer.
* Any node, which is neither an input nor an output node, is called a hidden node.
* It is possible to create networks with many hidden nodes.
* A standard approach is to have a layered approach, in which all nodes are partitioned in layers and each
layer n feeds in the next layer n+1 only.
_________________________________________________________________________________________________________
Three Layered Neural Networks
A simple, but already very powerful topology of nets is the three layered topology, in which there is one
input layer, one hidden layer, and one output layer.
[three-layer.jpg]
_________________________________________________________________________________________________________
Three Layered Neural Networks (Cont'd)
A typical topology of a three-layered neural network is:
* All input nodes are connected with all nodes in the hidden layer.
* All nodes in the hidden layer are connected with the output layer.
* No further nodes are connected.
Since the number of input and output nodes is often determined by the problem to be solved, the only decision
a designer has to make is to determine how many hidden nodes the net should have. The weights have then to be
determined in a so-called training phase of the net.
_________________________________________________________________________________________________________
Backpropagation
[backpropagation.jpg] Since we have multiple outputs, we want to minimise Error = ½sum[i in T]sum[k in
Outputs](d[i,k]-O[i,k])2 with:
* T is the set of training examples (given in form of an input list and an output list).
* d desired output
* O actual output
_________________________________________________________________________________________________________
Backpropagation Approach
1. Initialise the weights randomly with small numbers (e.g. between -0.05 and 0.05.)
2. Until termination condition reached, repeat for each training example x:
1. Take a training example x and compute the corresponding outputs.
2. For each output node k calculate its error term
delta[k]=lambda* O[k]* (1-O[k]) * (d[k]-O[k])
3. For each hidden node h calculate its error term
delta[h]=O[h]* (1-O[h]) * sum[k in Outputs] w[k,h]delta[k]
4. Update weights w[j,i] -> w[j,i]+delta w[j,i] with
delta w[j,i] = delta[j]* x[j,i]
_________________________________________________________________________________________________________
Intuitive Explanation for the Updates
* [(b)] delta[k]=lambda* O[k]* (1-O[k]) * (d[k]-O[k])
The change of weights to the output nodes should be proportional to the difference between desired and
actual output d[k]-O[k], multiplied with the derivative of the sigmoid function.
Note, with sigma(x)=1/ 1+ e^-lambda* x you get sigma'(x)=--lambda e^-lambda* x / (1+e^-lambda * x)2, that
is, sigma'(x) = lambda...igma(x)* (1-sigma(x))
* [(c)] delta[h]=O[h]* (1-O[h]) * sum[k in Outputs] w[k,h]delta[k]
The value for a hidden node is updated similarly. However, since there is no error value which directly
tells how to update it, the weighted average of its contribution to the errors is taken, where w[k,h] is
the weight between the hidden node h and the output node k.
_________________________________________________________________________________________________________
Naive Bayes Approach
Let a list of attribute values [a[1],a[2],... ,a[n]] be given, e.g., [Outlook=sunny, Humidity=high,
Wind=strong] . The task is to predict a target concept PlayTennis in form of a yes/no decision.
Day Outlook Temperature Humidity Wind PlayTennis
-------- ------------- -------------
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Rain Hot Normal Weak Yes
D14 Rain Mild High Strong No
_________________________________________________________________________________________________________
Naive Bayes Approach (Cont'd)
1. Approach: Estimate probabilities of P(v[j]) and P(a[i]|v[j]) from the frequencies in the training set.
2. Naively assume that the probabilities are all independent, i.e., we can compute P(a[1],a[2],... ,a[n]|
v[j]) as the product P(a[1]|v[j])* P(a[2]|v[j])*...* P(a[n]|v[j]).
3. The classifier is the most probable target value, that is, choose v[j] so that given the attribute values
[a[1],a[2],... ,a[n]], take v[max]=argmax[v[j] in V]P(v[j|a[]]1,a[2],... ,a[n])
4. With Bayes' theorem we get
v[max]= argmax[v[j] in V]P(a[1],a[2],... ,a[n]|v[j])P(v[j])/ P(a[1],a[2],... ,a[n])= argmax[v[j] in
V]P(a[1],a[2],... ,a[n]|v[j])P(v[j])
5. With the assumption in (2) we get
v[max]= argmax[v[j] in V] P(v[j])* P(a[1]|v[j])* P(a[2]|v[j])*...* P(a[n]|v[j])
_________________________________________________________________________________________________________
Naive Bayes Approach (Cont'd)
All that is needed are the values P(a[i]|v[j]) and P(v[j]), which are estimated from the training examples. In
the example:
P(PlayTennis = Yes) = 9/14 = 0.64
P(PlayTennis = No) = 5/14 = 0.36
P(Wind = strong | PlayTennis = Yes) = 3/9 = 0.33
P(Wind = strong | PlayTennis = No) = 3/5 = 0.60
Assumed we see a new instance
Day Outlook Temperature Humidity Wind PlayTennis
-------- ------------- -------------
new Sunny Cool High Strong ???
We compute:
P(Yes)* P(Sunny|Yes)* P(Cool|Yes)* P(High|Yes)* P(Strong|Yes)= 0.0053
P(No)* P(Sunny|No)* P(Cool|No)* P(High|No)* P(Strong|No)= 0.0206
Thus Naive Bayes chooses the attribute with the higher probability, i.e., No in this case.
_________________________________________________________________________________________________________
Example - Text Analysis
* Assume we want to learn the target concept C, e.g. C= "www page that discuss the topic `Artificial
Intelligence'."
* Assume we have k examples, e.g., pages of which we know for each whether it is relevant or not.
* We collect all n different words in all pages, e.g., n=1,000,000 words, w[1],w[2],... ,w[n] and estimate
their probabilities (from the training data): P(AIRelevant = Yes), P(AIRelevant = No), P(w[i] = Yes |
AIRelevant = Yes), P(w[i] = No | AIRelevant = Yes), P(w[i] = Yes | AIRelevant = No), P(w[i] = No |
AIRelevant = No), ... .
Any unseen article will be classified according to
P(Yes)* P(w[1]=val[1]|Yes)... P(w[n]=val[n]|Yes) (relevant)
P(No)* P(w[1]=val[1]|No)... P(w[n]=val[n]|No) (irrelevant)
Classify according to greater probability.
_________________________________________________________________________________________________________
Summary
* Neural Networks and the Naive Bayes approach can be used in areas in which humans have no good intuition
and do not know any causal relationship.
* The methods do not give any explanation for their answers.
* In some domains Naive Bayes has a comparable performance to decision tree learning and neural nets.
* Applied in many areas such as recognition of pictures
_________________________________________________________________________________________________________
Further Reading [books-shelf1.jpg]
* George F. Luger. Artificial Intelligence - Structures and Strategies for Complex Problem Solving. 4th
Edition, Addison Wesley, 2002, Chapter 10.
* Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997, Chapters 4 and 6.
* S. Russell, P. Norvig. Artificial Intelligence - A Modern Approach. 2nd Edition, Pearson Education, 2003.
Chapter 20.
* Many more general AI texts and specialised books and articles.
* Specialised lecture on Neural Nets.
_________________________________________________________________________________________________________
_________________________________________________________________________________________________________
© Manfred Kerber, 2004, Introduction to AI
24.4.2005
The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/Teaching/AI/l4.html.
URL of module [1]http://www.cs.bham.ac.uk/~mmk/Teaching/AI/
References
1. http://www.cs.bham.ac.uk/~mmk/Teaching/AI/