Introduction to AI - Worksheet 3
In this tutorial we are going to explore our second spam
filtering technique: decision tree classifiers. There are
several algorithms that automatically generate decision
trees, of which ID3 and C4.5 are the most well-known. We
are going to compare a hand-generated decision tree against
the ID3 algorithm.
This tutorial builds on the work done in the previous
tutorial. The rules you created last week will form the set
of predicates from which nodes of the decision tree will be
created. If you didn't get very far last week some rules
are provided for you. You will probably do better if you
create your own rules.
The installation instructions and code are in the same place
as last week.
There are many online resources for decision tree-based spam
filtering, as a search will show. The commercial
applications show the utility of this approach, but the more
useful for our purpose are research papers, as they actually
give details of the implementation. A good place to start
in the literature is
http://citeseer.ist.psu.edu/478863.html. By
following the bibliographic links you will find much of the
existing literature on spam filtering.
EXERCISE: 1
Your first task is to get the decision tree spam filter up
and running. To do so, change to the directory where you
installed the software and run SISC:
> sisc
SISC (1.8.8) - main
#;>
Now load the file spam-filter.scm, which runs the
decision tree spam filter. You should see output similar
to that below:
#;> (load "spam-filter.scm")
Classifying Spam
==================
Spam 1: Classified by (Example Tree)
Spam 2: Classified by (Example Tree)
Spam 3: Not classified (spam #f)
Spam 4: Classified by (Example Tree)
...
spam classification done
Classifying email
==================
...
You should recognise this from last week - the basic
framework hasn't changed. We're now ready to start
extending the decision tree spam filter.
EXERCISE: 2
Your next task is to create a new decision tree. Take a
look at the example decision tree
ExampleTree.java (or example-tree.scm in
Scheme). Notice that the class extends
AbstractTree. Your class must do the same. Now
look at the value assigned to head - this is the
actual tree of nodes that defines the decision tree. The
actual decision is made by recursively calling the
Nodes that make up the decision tree till a
boolean value is returned. If true is
returned the email is classified as spam; otherwise it is
classified as genuine email. A BinaryNode is
constructed from 3 arguments: a Rule that forms
the predicate of the node, a Node that is called
if the predicate returns true, and a Node that is
called if the predicate returns false. A
TrueNode returns true in all situations.
A FalseNode returns false in all
situations.
Now write your own decision tree using the rules we have
supplied and the rules you have created on your own. As
before you simply need to compile the file and re-load
spam-filter.scm for the decision tree to be
loaded.
EXERCISE: 3
Compare your decision tree to that created by ID3. To
do this un-comment the line (load "id3.scm") in
spam-filter.scm and re-load
spam-filter.scm.
How does your decision tree compare to ID3? Can you
improve your decision tree to equal or better the
performance of ID3? Can you create a smaller tree than
the one created by ID3 that performs just as well?
EXERCISE: 4
The way that data is represented is critical to the
performance of any machine learning algorithm. We have a
very flexible representation, as our rules can match just
about anything. However it requires substantial human
effort to create the rules. The more common
representation is the "bag of words" representation. In
this representation each email is represented by a vector
of booleans. The vector is as long as the number of words
in the corpus - 30,000 words is not unusual. An element
is true if the word corresponding to that word is present
in the email, and false otherwise. Implement the ID3
algorithm using this representation. How does this
compare to the decision trees using our hand crafted
rules?
EXERCISE: 5
A decision tree is a conjunction of disjunctions of unary
predicates. This representation is quite limiting. The expert
system from last week allows a much more flexible representation -
in fact any logical form is allowed. Can you use this extra
flexibility to out-perform the decision tree? How could a machine
learning algorithm make use of the extra flexibility?
© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
26.11.2004
The URL of this page is ftp://ftp.cs.bham.ac.uk/pub/authors/M.Kerber/Teaching/AI/e3.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/