Introduction to AI  Week 3
Learning
Decision Trees
Reinforcement Learning
 Inductive learning involves making uncertain inferences that go
beyond our direct experience. [Anderson95]
 Scientific discovery by proof is impossible, except
one knows the "first" primary premises ... We must obtain these
premises by induction. [Aristotle ~330
BC]
Decision Trees
 A Decision Tree takes as input an object given by a set of
properties, output a Boolean value (yes/no decision). Each internal
node in the tree corresponds to test of one of the properties.
Branches are labelled with the possible values of the test.
 Aim: Learn goal concept (goal predicate) from examples
 Learning element: Algorithm that builds up the decision tree.
 Performance element: decision procedure given by the tree
Example
Problem to wait for a table at a restaurant. A decision
tree decides whether to wait or not in a given situation. Attributes:
 Alternate: alternative restaurant nearby
 Bar: bar area to wait
 Fri/Sat: true on Fridays and Saturdays
 Hungry: whether we are hungry
 Patrons: how many people in restaurant (none, some, or full)
 price: price range (£ , £ £ , £ £ £ )
 raining: raining outside
 reservation: whether we made a reservation
 type: kind of restaurant (French, Italian, Thai, or Burger)
 WaitEstimate: estimated wait (<10, 1030,3060,>60)
Original Decision Tree
Expressiveness of Decision Trees
 Each path through tree corresponds to implication like
FORALL r
Patrons(r,Full) & WaitEstimate(r,010)
& Hungry(r,N) > WillWait(r)
Hence Decision tree corresponds to conjunction of implications.
 Cannot express tests that refer to two different objects like:
EXISTS r_{2} Nearby(r_{2}) & Price(r,p) & Price(r_{2},p_{2}) & Cheaper(p_{2},p)
 Expressiveness essentially propositional logic (no function
symbols, no existential quantifier)
 Complexity for n attributes is
2^{2n}, since for each
function 2^{n} values have to be defined. (e.g. for n=6 there are
2 x 10^{19} different functions)
 Functions like parity function (1 for even, 0 for odd) or
majority function (1 if more than half of the inputs are 1) end in
large decision trees.
Some Examples
Ex 
Atrributes 









Goal 

Alt 
Bar 
Fri 
Hun 
Pat 
Price 
Rain 
Res 
Type 
Est 
Wait 
 
 
 
X_{1} 
Yes 
No 
No 
Yes 
Some 
£ £ £ 
No 
Yes 
French 
010 
Yes 
X_{2} 
Yes 
No 
No 
Yes 
Full 
£ 
No 
No 
Thai 
3060 
No 
X_{3} 
No 
Yes 
No 
No 
Some 
£ 
No 
No 
Burger 
010 
Yes 
X_{4} 
Yes 
No 
Yes 
Yes 
Full 
£ 
Yes 
No 
Thai 
1030 
Yes 
X_{5} 
Yes 
No 
Yes 
No 
Full 
£ £ £ 
No 
Yes 
French 
>60 
No 
X_{6} 
No 
Yes 
No 
Yes 
Some 
£ £ 
Yes 
Yes 
Italian 
010 
Yes 
X_{7} 
No 
Yes 
No 
No 
None 
£ 
Yes 
No 
Burger 
010 
No 
X_{8} 
No 
No 
No 
Yes 
Some 
£ £ 
Yes 
Yes 
Thai 
010 
Yes 
X_{9} 
No 
Yes 
Yes 
No 
Full 
£ 
Yes 
No 
Burger 
>60 
No 
X_{10} 
Yes 
Yes 
Yes 
Yes 
Full 
£ £ £ 
No 
Yes 
Italian 
1030 
No 
X_{11} 
No 
No 
No 
No 
None 
£ 
No 
No 
Thai 
010 
No 
X_{12} 
Yes 
Yes 
Yes 
Yes 
Full 
£ 
No 
No 
Burger 
3060 
Yes 
Different Solutions
 Trivial solution: Construct decision tree that has one path to a leaf for each example
Given the examples o.k., but else bad
 Occam's razor: The most likely hypothesis is the simplest
one that is consistent with all observation.
 Finding smallest decision trees is intractable, hence heuristic decisions: test the most important attribute
first. Most important = makes most difference to the
classification of an example.
Short paths in the trees, small trees
 Compare splitting the examples by testing on attributes (cf. Patrons, Type)
Selecting Best Attributes
Selecting Best Attributes (Cont'd)
Recursive Algorithm
 If there are positive and negative examples, choose "best"
attribute to split (i.e., Patrons in the example above)
 If all remaining examples positive (or all negative), then done
 If no examples left, no information, hence default value
 If no attributes left, but both positive and negative examples,
then problem. Examples have same description but different
classification due to incorrect data (noise), not enough
information, or nondeterministic domain
Then majority vote.
DecTreeLearning ID3
function DecTreeL(ex,attr,default);;; returns decision tree
;;; set of examples, of attributes, default for goal predicate
if ex empty then return default
elseif all ex have same classification then return it
elseif attributes empty then return MajVal(ex)
else ChooseAttribute(attr,ex) > best1
new decision tree with root test best1 > tree1
for each value v_{i} of best1 do
{elements of ex with best1=v_{i}} > ex_{i}
DecTreeL(ex_{i},attr,MajVal(ex)) > subtree1
add branch to tree with label v_{i}
subtree subtree1
end
return tree
Generated Decision Tree
Discussion of the Result
Comparison of original tree and learned tree:
 Trees differ.
 Learned tree is smaller (no test for raining and reservation, since all examples can be classified without them).
 Detects regularities (waiting for Thai food on weekends).
 Can make mistakes (e.g. case where the wait is <10, but the restaurant is full and not hungry).
 Question: if consistent, but incorrect tree, how correct is the tree?
Assessing the Performance
 Collect a large set of examples.
 Divide it into two disjoint sets: training set and test set.
 Learn algorithm with training set and generate hypothesis H.
 Measure percentage of examples in test set that are correctly classified by H.
 Repeat steps 1 to 4 for different sets.
Assessing the Performance (Cont'd)
Learning curve shows increase of the quality of the prediction, when training set grows.
Applications
 ID3 used to classify boards in a chess endgame. ID3 had to
recognise boards that led to a loss within 3 moves. Classification
of half a million positive situations from 1.4 million different
possible boards. Typical learning curve as result.
 Building up an expert system for designing gasoil separation
systems for oil platforms.
gasoil
XPS of BP with 2500 rules.
Building by hand: 10 personyears, using decisiontree learning 100
persondays.
 Learning to fly: Flight simulator, generated by watching three
skilled human pilots. 90,000 examples and 20 state variables
labelled by the action taken. Extract decision tree which was
translated into C code. Program could fly better than its teachers.
Finding Best Attributes
In order to build up small decision trees: select best
attributes first (best = most informative)
 measure information in bits
 One bit of information is enough to answer a yes/no question
about which one has no idea (e.g. flip of a fair coin)
 If the possible answers u_{i} have probabilities P(u_{i}), then
I(P(u_{1}),... ,P(u_{n})) = SUM_{i=1}^{n} P(u_{i})* ld(P(u_{i}))
 e.g., fair coin: I(½,½) = 1 (1 bit)
 e.g., if we know already the outcome by 99%, the information of
the real outcome has the (expected) information of: I(1/100,99/100)=0.08 bits.
If we know outcome by 100%, no additional information, I=0.
Logarithm
ld(x) (dual logarithm) is defined for every positive real number x such that
2^{ld(x)} = x
Logarithm (Cont'd)
Some values:
x 
1 
2 
4 
8 
10 
16 
1/2 
1/4 
1/8 
ld(x) 
0 
1 
2 
3 
3.32 
4 
1 
2 
3 
lim_{x> 0+}ld(x)=∞ lim_{x> 0+}x*ld(x)=0
Remember:
log_{10}x= log_{10}2 * ld(x)
Calculations in the Examples
I(½,½) 
= 
SUM_{i=1}^{2} P(u_{i})* ld(P(u_{i})) 
= 
½*ld(½)½*ld(½) 
= 
½* (1)½* (1) 
= 
½+½ 
= 
1 
with P(u_{i})=½
Calculations in the Examples (Cont'd)
I(0/ 100,100/ 100) 
= 
SUM_{i=1}^{2} P(u_{i})* ld(P(u_{i})) 
= 
0/ 100*ld(0/ 100)100/ 100*ld(100/ 100) 
*= 
0* (∞)1* 0 
*= 
0+0 
= 
0 
with P(u_{1})=0/ 100 and with P(u_{2})=100/ 100
(*) Strictly you have to use here
lim_{x> 0+}x*ld (x) = 0.
Calculations in the Examples (Cont'd)
I(1/ 100,99/ 100) 
= 
SUM_{i=1}^{2} P(u_{i})* ld(P(u_{i})) 
= 
1/ 100*ld(1/ 100)99/ 100*ld(99/ 100) 
= 
1/ 100* (6.64386)99/ 100* (0.0145) 
= 
0.066439 + 0.014355 
= 
0.080794 
with P(u_{1})=1/ 100 and with P(u_{2})=99/ 100
ld(1/ 100) = 6.64386 and ld(99/ 100) = 0.0145
Applied to Attributes
p := number of positive examples
n := number of negative examples
Information contained in correct answer:
I(p/ p+n,n/ p+n)=  p/ p+n ld p/ p+n 
n/ p+nldn/ p+n
Restaurant example: p=n=6, hence we need 1 bit of information. A
test of one single attribute A will not usually give this, but only
some of it. A divides example set E into subsets
E_{1},... ,E_{u}. Each subset E_{i} has p_{i} positive and n_{i}
negative examples, so we need in this branch an additional
I(p_{i}/ (p_{i}+n_{i}),n_{i}/ (p_{i}+n_{i})) bits of information,
weighted by
(p_{i}+n_{i})/(p+n) (probability of a random example)
HENCE information gain:
Gain(A) = I(p/ p+n,n/ p+n) 
SUM_{i=1}^{u} p_{i}+n_{i}/p+n * I(p_{i}/ p_{i}+n_{i},n_{i}/ p_{i}+n_{i})
Heuristics
Choose attribute with largest information gain. In the restaurant example, initially:
alternative 
bar 
friday 
hungry 
patrons 
 
 
 
0.0 
0.0 
0.020721 
0.195709 
0.540852 
price 
rain 
reservation 
type 
estimate 
 
 
 
0.195709 
0.0 
0.020721 
0.0 
0.207519 
Hence: choose "patrons"
Noise and Overfitting
 Overfitting, problem not to find meaningless regularity in
the data (examples: rolling dice characterised according to
attributes like hour, day, month result in perfect decision tree,
when no two examples have identical description)
 possibility: decision tree pruning by detecting irrelevant
attributes. Irrelevant = no information gain for an infinitely large
sample.
null hypothesis, assumes that there is no underlying
pattern. Only if significant deviation (e.g more than 5%)
attribute considered.
 alternative: crossvalidation, i.e. take only part of the
data for learning and rest for testing the prediction performance.
Repeat with different subsets and select best tree. (can be combined with pruning)
Reinforcement Learning
Assume the following stochastic environment
Each training sequence has the form:
 (1,1)>(1,2)>(1,3)>(2,3)
>(1,3)>(2,3)>(3,3)>(4,3) reward +1
 (1,1)>(2,1)>(1,1)>(2,1)
>(3,1)>(3,2)>(4,2) reward 1
Rewards
Probability for a transition to a neighbouring state is equal
among all possibilities, i.e.
Assume utility function is additive, i.e.
U([s_{0},s_{1},... ,s_{n}]) = reward(s_{0}) + U([s_{1},... ,s_{n}])
with e.g. pected utility of a state is the
expected rewardtogo of that state.
Utility to be Learned
Can be learned by Least Mean Squares approach, short LMS, (also called
adaptive control theory). It assumes that the observed
rewardtogo on that sequence provides direct evidence of the actual
expected rewardtogo. At end of each sequence: calculate
rewardtogo for each state and update utility
Passive Reinforcement Learning
vars U ;;; table of utility estimates
vars N ;;; table of frequencies for states
vars M ;;; table of transition probabilities from state to state
vars percepts ;;; percept sequence, initially empty
function PassiveRLAgent(e);;; returns an action
add e to percepts
increment N(State(e))
UPDATE(U,e,percepts,M,N) > U
if Terminal?(e) then nil > percepts
return action Observe
function LMSUpdate(U,e,percepts,M,N);;; returns updated U
if Terminal?(e) then 0 > rewardtogo;
for each e_{i} in percepts (starting at end) do
rewardtogo+Reward(e_{i}) > rewardtogo;
RunningAverage(U(State(e_{i})),rewardtogo,N(state(e_{i})))
> U(State(e_{i}));
end
Summary  Decision Tree Learning
 Decision Tree Learning: very efficient way of nonincremental learning
space.
 It adds a subtree to the current tree and continues its search.
 It does not backtrack.
 It is highly dependent upon the criteria for selecting
properties to test.
 It can be extended to allow more than two values as result of the classification
 It can be extended to deal with noise.
Summary  Reinforcement Learning
 Reinforcement Learning: incremental learning approach.
 We could only give a glimpse of reinforcement learning.
 We looked only at the example of a passive agent, which
observes the world. Typically you will have an active agent, which
can make decisions based on its partial knowledge of the world.
 An active agent has to decide whether it should
exploit its current knowledge, or
explore the world.
Further Reading
 S. Russell, P. Norvig. Artificial Intelligence  A
Modern Approach. 2nd Edition, Pearson Education,
2003. Sections 18.3 & 21.2.
 G. Luger, W. Stubblefield. Artificial Intelligence 
Structures and Strategies for Complex Problem Solving. 2nd
Edition, The Benjamin/Cummings Publishing Company, 1993.
 J.R. Quinlan, Induction of Decision Trees. Machine
Learning, 9(1):81106, 1986.
 J.R. Quinlan, The effect of noise on concept learning. In
Michalski et al., eds., Machine Learning: An Artificial
Intelligence Approach, Vol. 2. Morgan Kaufmann. 1986.
© Manfred Kerber, 2004, Introduction to AI
24.4.2005
The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/Teaching/AI/l3.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/