Introduction to AI - Week 4 Neural Nets Naive Bayes * The human brain can be viewed (in a very simplified model) as a network between neurons * Simple probabilistic reasoning with strong assumptions can lead to powerful prediction _________________________________________________________________________________________________________ Neural Networks Neural Networks have the following features. * They are inspired by nature. The human mental capabilities are generated by the brain, which can be viewed as a large set of neurons (10^10 to 10^11 many), each connected to around 10^3 to 10^4 many others. * Unlike to expert systems or decision trees neural nets do not make use of an explicit symbolic representation. * Each individual neuron behaves in a very simple way. _________________________________________________________________________________________________________ A Neuron W1 001 000 X1 111111111111111111111111111111000 1 001 0 100 W2 1 0 000 X2 00000011111111100000000000001000000 001 0 100 W3 0 000 X3 00000011111111100000000000001000000 00 0 100 0 001 100001100001 W4 0 0 00 101 00 X4 00000011111111000000000000001000010 100 01 0 0 001 0 0 W5 0 00 0 11 X5 11111111111111111111111111111100010 SIGMA 100111 00010 THRESHOLD 0 1111 0000 01 0 001 001 0 0 001 0 100 0 0 W6 0 00 01 0 X6 11111111111111111111111111111000000 001 00 10 0 100 100 101 0 00 0000000000 0 001 * 0 100 * 0 00 * 0 001 0 100 0 00 0 000 WN 1 100 XN 1111111111111111111111111111 1000101 _________________________________________________________________________________________________________ A Neuron (Cont'd) * A neuron receives inputs from other neurons, x[1],x[2],... ,x[n], which are weighted by weights w[1],w[2],... ,w[n]. * In the simplest model, a neuron takes its weighted inputs and sums them up S=sum[i]^nw[i]* x[i]. * Then it checks whether the sum is above a given threshold S>= Threshold. * If yes, it sends out the value 1 (excitatory). * If not, it sends out the value -1 (inhibitory). _________________________________________________________________________________________________________ Building a Neural Net A neural net consists of * a set of input neurons, a set of output neurons, and a set of hidden neurons, * a set of directed links which connect two neurons each, * a set of weights (for each link one). In the simple case, which we will consider in the following, the set of input neurons, the set of output neurons, and the set of hidden neurons are given. Furthermore the topology is given, that is, the programmer has fixed initially which neuron is connected to which others. The weights have to be learned. _________________________________________________________________________________________________________ A Perceptron Let's assume we have a number i of input nodes and a single output node, a perceptron. We have to determine the weights such that the output is 1 if sum w[i]x[i] >= threshold -1 if sum w[i]x[i] < threshold We call this value sign(sum w[i]x[i]). When a particular setting of weights is given and we get a new training example (set of inputs) we do the following with a fixed so-called learning rate lambda. : * If the desired output and the actual output values are equal, do nothing. * If the actual output is -1, but should be 1, increase the weights, that is, w[i] -> w[i]+delta w[i] with delta w[i] = lambda* (d[i]-O[i])x[i]. * If the actual output is 1, but should be -1, decrease the weights, that is, w[i] -> w[i]+delta w[i] with delta w[i] = lambda* (d[i]-O[i])x[i]. _________________________________________________________________________________________________________ Example - Perceptron x[1] x[2] Output -------- ------------- ------------- 1.0 1.0 1 9.4 6.4 -1 2.5 2.1 1 8.0 7.7 -1 0.5 2.2 1 7.9 8.4 -1 7.0 7.0 -1 2.8 0.8 1 1.2 3.0 1 7.8 6.1 -1 W1 0 X1 1000011 111 11000000000100000 000 001 11 100 11 001 11 00 11 100 11 001 11 00 11 100 00001100001 11 001 00 01 11 00 0 0 W2 11 000 0 0 0 0 001 0 0 00 X2 00001111111111000000000100000 SIGMA 0111 00000 THRESHOLD 001111 0000 100 0 0 00 11 001 0 0 0 11 000 10 0 11 100 0 00 11 001 1001 000 11 000 10011 11 00 11 001 11 100 11 00 11 001 W3 11 100 1 00 BIAS 000000000011111111111 1000101 _________________________________________________________________________________________________________ Example - Perceptron (Cont'd) 10 010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101 0 0 11 000 11 0 00 11 0 00 11 0001011 00 11 0 00 F(NET)=0 11 0 001 11 0 00 1 11 0 00 00001 11 0 00 11 0001011 101 11 0 00 11 0 100 1111 11 0 00 11111 11 x2 0 00 11 0 00 11 0001011 00 11 0 00 00001 11 0 00 11 0 00 11 0 00 11 5 0 00 10000 11 0 00 00000 11 0010001011 00 11 0 00 11 0 00 11 0 00 11 0 101 11 0 00 11 101001011 101 11 0 00 11 0 100 11 0 00 11 0 100 11 0 00 11 0001011 00 11 0 00 11 0 00 11 0 00 11 0 00 11 0 1 00 11 0001011 0 1 00 11 0 1 0 1 00 11 0 00 11 0 101 11 0 0 0 00 11 0 00000 1 0 1 101 11 0 1 0 00 11 0001011 101 11 0 00 11 0 101 11 0 00 11 0 00 11 0 0 00 11 0001011 00000 0 00 11 0 0 00000 00 11 0 0 00 11 0 00 11 0 00 11 0 0 0 1 0 0 1 0 10 0 01000000010010000000101000000001010000000010100000001010000000010100000001001000000010101110010101000000010 0 0 1 0 0 1 0 0 11 0 0 0 0 0 01 0 0 0 5 10 x1 We want to find a way to separate the positive from the negative examples. _________________________________________________________________________________________________________ Example - Perceptron (Cont'd) 1. Assume initial (guessed) weights as [0.75,0.5,-0.6]. 2. With the first value we get: f(net)^1=f(0.75* 1.0 + 0.5* 1.0- 0.6* 1) = 1, hence no correction 3. f(net)^2=f(0.75* 9.4 + 0.5* 6.4- 0.6* 1) = 1, however the result should be -1, so we have to apply the learning rule, i.e., w[i]-> w[i]+lambda*(d-sign(w[i]* x[i]))x[i], that is, w[1]-> 0.75 +0.2*(-1-1)* 9.4=-3.01, likewise w[2]=-2.06 and w[3]=-1.00. 4. This will be repeated for all input values and we will get the final weights satisfying 1.3* x[1] +1.1* x[2]-10.9=0 See [Luger, p. 424ff]. _________________________________________________________________________________________________________ Properties of a Perceptron * The choice of the learning rate lambda should not be too big (you will miss the true value), or too small (learning will be very slow). It is usual to decrease the learning rate slowly over time. * All that can be learned is a straight line to separate positive from negative examples. But what, if the positive and negative examples cannot be separated by a straight line? * Use more sophisticated threshold functions. M M M F(X) M F(X) M M M M M M M M M M 1 M SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS22: 1 M M M M M M M M M M M M M M M M M M XMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMSMSMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMW M M M M X M X M M M M M M M M M M M M M M -1 iaSSS22222222222222222222222222222222222222222222SS M -1 M M M M M M M M 0 0 0 f(x) 0 0 0 0 0 0 0 0 0 1 0 0 11111111111111111111111111111111111111111111111111 0 1111111111111111111111111111111111111111111111111111111 0 0 0 111100011100000111111001 0 10001010111 0 10000 0 000 0 001 0 000 0 00 0 001 0 000 0 000 0 1001 0 1000 0 1100000 0 1000011100000011000101011 0 0 100000000000000000000000000000000000000000000000000101000000000000000000000000000000000000000000000000000000000 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hard limiting and bipolar threshold sigmoidal and unipolar threshold _________________________________________________________________________________________________________ Sigmoidal Thresholds w1 x1 00000110 010000000000000010001000 0 000 w2 0 001 x2 0000001110100000000000000100010 100 0 001 w3 0 00 x3 1000011111100000000000000100010 100 0 001 w4 0 00 0000000001 x4 0000000110100000000000000100010 100 01 101 0 001 10 1 10 w5 0 100 0 111111111111 01 1 x5 1000010010100000000000000100010 SIGMA 0010000000000010000 0 1 0000 0 11 10 0 00 01 0 00000 01000100000 0 0 001 11 0000 1 0 000 w6 0 100 0 11111111111 0 x6 1000011111100000000000000100010 00 01 1 0 0 001 100 101 0 100 0000000 * 0 001 * 0 001 * 0 100 0 001 0 000 wn 0 00 xn 1111111111111111111111111 000101 * First the weighted inputs are summed up: net=sum[i=1]^nw[i]* x[i]. * Second the threshold function is applied: output=sigma(net)=1/ 1+ e^-lambda* net _________________________________________________________________________________________________________ Why Sigmoidal Thresholds? * Nice theoretical properties (differentiable) * Can approximate hard-limiting threshold function well for large lambda. For learning (training) a network, use backpropagation. Try to minimise the squared error Error = ½sum[i](d[i]-O[i])^2 _________________________________________________________________________________________________________ Multi Layered Neural Networks In order to overcome the limitations of the perceptron, create more complicated networks. * The perceptron can be viewed as a network of two layers, the input layer and the output layer. * Any node, which is neither an input nor an output node, is called a hidden node. * It is possible to create networks with many hidden nodes. * A standard approach is to have a layered approach, in which all nodes are partitioned in layers and each layer n feeds in the next layer n+1 only. _________________________________________________________________________________________________________ Three Layered Neural Networks A simple, but already very powerful topology of nets is the three layered topology, in which there is one input layer, one hidden layer, and one output layer. O1 O2 O3 .... Oj .... On OUTPUT LAYER 001 1 000 0 0000 11000 0 1111100 0 1 0 0 100 1100 010 000000 00 01 1110 111001 0 0 0001111 11100 100 00 01 00 11 1 00 10 00 0 00 111 0001 000 0 1 00 111 0 0 0 01 000 0 000 101000 001001 0001 0 11100 00 0 0 00 1011 0 01 10 100000 10 111011 100 0 W 0 110 010 01 00 01001 111 10101 111 11101 10 101 0 0 0 001 1 1000 0 10011 0010001 0000 1 0 1000 00 0 0 0 01 01 0 10 1000 0 101 1 10000 00 100 100 1000 00 0 0 1111 111100010 10 0000 1 111 1001 11000 1 0000100111 1111 00 0 1 00 1000111 1100111 101100101 100000 101000111 101 H1 H2 H3 .... Hi Hm HIDDEN LAYER 0010 1 101 01 11 0 000 1 10100 0 0 0 00 000000 101 100 01100000 0 0111011 111 10 0 0 1 111000000 00 001 0 011000 00011 00 1 000 011 00 0 0 01 01 1 0 1 000 011 0000 01 11 0 1 100 101 0 0 01 000 10101 10 110 11 001 101 1001 1 1101 01 101 0 0 100 101 000 1 010001 000 0000 1001 100 11 0 00 10010 0 100 1 10000 0010 00010 101 00 10 0 0 0 010 1 100 10 000101 001 0000 1111111000 00 0 0 0 100 1 10 1 001 1 111 00 1001 0110010 0011001 00 0 0 0 10 11 01 01 001110110 11 10001101 1 0 1000 110001 0 0 0 010 111001 0 11111 00 011000 1100111 0 01 100001 100000 0 100 100011 100111 00011 10001 100111 00 I1 I2 I3 .... IL IK INPUT LAYER _________________________________________________________________________________________________________ Three Layered Neural Networks (Cont'd) A typical topology of a three-layered neural network is: * All input nodes are connected with all nodes in the hidden layer. * All nodes in the hidden layer are connected with the output layer. * No further nodes are connected. Since the number of input and output nodes is often determined by the problem to be solved, the only decision a designer has to make is to determine how many hidden nodes the net should have. The weights have then to be determined in a so-called training phase of the net. _________________________________________________________________________________________________________ Backpropagation Z Z MM M. M M MM MM BMr WM 2MMS MMM MMM MM MMMM MMMM @MMM7 SMMM M M MMMM SX M ..,:,,,ii:iiiii::,,,,. M 7 .ii: , ,,::: M ,, M :,i::. .::i,: M ,:i, iM :i. i: .: :, :. OUTPUT LAYER i ; OUTPUT LAYER ; :. OUTPUT LAYER : :, S ii 7 XM. ,:i. .i, MM M2 ::i:. .:i,:. .M M@ :::iii:, .::i::,i. M Mi WW :::,,:,,,:iiiiii:,,,, i: M M MM M @ B W @a M MM M, M M M; M M MMMM MM MM 0M; MM M M aZZa MMM MMM MMM MMM ,M M :: 7MMMM MMMM; MMMM; MMMM M. MS aa M W B W M B BB M M M M 7B MMM. ii 8 Z M MMMMMMB BM.M , M :,,::ii::::::::::::::...:i:::, M MMMMM XM M MM Mi MM. .:,ii::. :ii::: BMXi rM @ @M M8 M:M :,i,. .::B ::, ::, BACK ERROR FORWARD NETWORK ACTIVATION i, :, PROPAGATION FORWARD NETWORK ACTIVATION ,: HIDDEN LAYER .i ; HIDDEN LAYER : ; HIDDEN LAYER ; BACK ERROR FORWARD NETWORK ACTIVATION i. ; PROPAGATION FORWARD NETWORK ACTIVATION .i :: .:, .. ,i:: 7: .:, 2 ,:i:i: .i::,.M0 rM; .,:i,:,:ii:.... .....ii:::, .. aM M0 M ......,,. M M 0M rM MZ iM MX MM: WB W MM M M MMM MM M MMM rB M MMMMr MM MM MMMM B M M MMMM MMM M 8Z M M ;; MMMM M M M M ;; M M iM Mi M aa M M ;M :Xa M aa M M MMMB .MMM7 M WB Z M MMMZ MMMa M . .ii::::,:::,,,,,,,::::::,iii:, M MM. MM .:,::i,. :i::,7, Z ::i:. ,:, ,:: i,. ,i .:. :. .: :, INPUT LAYER i :. INPUT LAYER : :. INPUT LAYER .i ,: i ,:, ::, ::i, .:::: .::iii, ::ii::. ....,,,.,...............,,..,,,..., Since we have multiple outputs, we want to minimise Error = ½sum[i in T]sum[k in Outputs](d[i,k]-O[i,k])^2 with: * T is the set of training examples (given in form of an input list and an output list). * d desired output * O actual output _________________________________________________________________________________________________________ Backpropagation Approach 1. Initialise the weights randomly with small numbers (e.g. between -0.05 and 0.05.) 2. Until termination condition reached, repeat for each training example x: 1. Take a training example x and compute the corresponding outputs. 2. For each output node k calculate its error term delta[k]=lambda* O[k]* (1-O[k]) * (d[k]-O[k]) 3. For each hidden node h calculate its error term delta[h]=O[h]* (1-O[h]) * sum[k in Outputs] w[k,h]delta[k] 4. Update weights w[j,i] -> w[j,i]+delta w[j,i] with delta w[j,i] = delta[j]* x[j,i] _________________________________________________________________________________________________________ Intuitive Explanation for the Updates * [(b)] delta[k]=lambda* O[k]* (1-O[k]) * (d[k]-O[k]) The change of weights to the output nodes should be proportional to the difference between desired and actual output d[k]-O[k], multiplied with the derivative of the sigmoid function. Note, with sigma(x)=1/ 1+ e^-lambda* x you get sigma'(x)=--lambda e^-lambda* x / (1+e^-lambda * x)^2, that is, sigma'(x) = lambda...igma(x)* (1-sigma(x)) * [(c)] delta[h]=O[h]* (1-O[h]) * sum[k in Outputs] w[k,h]delta[k] The value for a hidden node is updated similarly. However, since there is no error value which directly tells how to update it, the weighted average of its contribution to the errors is taken, where w[k,h] is the weight between the hidden node h and the output node k. _________________________________________________________________________________________________________ Naive Bayes Approach Let a list of attribute values [a[1],a[2],... ,a[n]] be given, e.g., [Outlook=sunny, Humidity=high, Wind=strong] . The task is to predict a target concept PlayTennis in form of a yes/no decision. Day Outlook Temperature Humidity Wind PlayTennis -------- ------------- ------------- D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Rain Hot Normal Weak Yes D14 Rain Mild High Strong No _________________________________________________________________________________________________________ Naive Bayes Approach (Cont'd) 1. Approach: Estimate probabilities of P(v[j]) and P(a[i]|v[j]) from the frequencies in the training set. 2. Naively assume that the probabilities are all independent, i.e., we can compute P(a[1],a[2],... ,a[n]| v[j]) as the product P(a[1]|v[j])* P(a[2]|v[j])*...* P(a[n]|v[j]). 3. The classifier is the most probable target value, that is, choose v[j] so that given the attribute values [a[1],a[2],... ,a[n]], take v[max]=argmax[v[j] in V]P(v[j|a[]]1,a[2],... ,a[n]) 4. With Bayes' theorem we get v[max]= argmax[v[j] in V]P(a[1],a[2],... ,a[n]|v[j])P(v[j])/ P(a[1],a[2],... ,a[n])= argmax[v[j] in V]P(a[1],a[2],... ,a[n]|v[j])P(v[j]) 5. With the assumption in (2) we get v[max]= argmax[v[j] in V] P(v[j])* P(a[1]|v[j])* P(a[2]|v[j])*...* P(a[n]|v[j]) _________________________________________________________________________________________________________ Naive Bayes Approach (Cont'd) All that is needed are the values P(a[i]|v[j]) and P(v[j]), which are estimated from the training examples. In the example: P(PlayTennis = Yes) = 9/14 = 0.64 P(PlayTennis = No) = 5/14 = 0.36 P(Wind = strong | PlayTennis = Yes) = 3/9 = 0.33 P(Wind = strong | PlayTennis = No) = 3/5 = 0.60 Assumed we see a new instance Day Outlook Temperature Humidity Wind PlayTennis -------- ------------- ------------- new Sunny Cool High Strong ??? We compute: P(Yes)* P(Sunny|Yes)* P(Cool|Yes)* P(High|Yes)* P(Strong|Yes)= 0.0053 P(No)* P(Sunny|No)* P(Cool|No)* P(High|No)* P(Strong|No)= 0.0206 Thus Naive Bayes chooses the attribute with the higher probability, i.e., No in this case. _________________________________________________________________________________________________________ Example - Text Analysis * Assume we want to learn the target concept C, e.g. C= "www page that discuss the topic `Artificial Intelligence'." * Assume we have k examples, e.g., pages of which we know for each whether it is relevant or not. * We collect all n different words in all pages, e.g., n=1,000,000 words, w[1],w[2],... ,w[n] and estimate their probabilities (from the training data): P(AIRelevant = Yes), P(AIRelevant = No), P(w[i] = Yes | AIRelevant = Yes), P(w[i] = No | AIRelevant = Yes), P(w[i] = Yes | AIRelevant = No), P(w[i] = No | AIRelevant = No), ... . Any unseen article will be classified according to P(Yes)* P(w[1]=val[1]|Yes)... P(w[n]=val[n]|Yes) (relevant) P(No)* P(w[1]=val[1]|No)... P(w[n]=val[n]|No) (irrelevant) Classify according to greater probability. _________________________________________________________________________________________________________ Summary * Neural Networks and the Naive Bayes approach can be used in areas in which humans have no good intuition and do not know any causal relationship. * The methods do not give any explanation for their answers. * In some domains Naive Bayes has a comparable performance to decision tree learning and neural nets. * Applied in many areas such as recognition of pictures _________________________________________________________________________________________________________ Further Reading books-shelf1.jpg] * George F. Luger. Artificial Intelligence - Structures and Strategies for Complex Problem Solving. 4th Edition, Addison Wesley, 2002, Chapter 10. * Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997, Chapters 4 and 6. * S. Russell, P. Norvig. Artificial Intelligence - A Modern Approach. 2nd Edition, Pearson Education, 2003. Chapter 20. * Many more general AI texts and specialised books and articles. * Specialised lecture on Neural Nets. _________________________________________________________________________________________________________ _________________________________________________________________________________________________________ © Manfred Kerber, 2004, Introduction to AI 22.12.2004 The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/Teaching/AI/l4.html. URL of module [1]http://www.cs.bham.ac.uk/~mmk/Teaching/AI/ References 1. http://www.cs.bham.ac.uk/~mmk/Teaching/AI/