Introduction to AI - Worksheet 5
Read this worksheet and install the software before
In this tutorial we are going to explore our final
spam filtering technique: genetic algorithms. As usual
you will find installation instructions on the module web
As discussed in the lectures, genetic algorithms can be used
to search a space of hypotheses. In this worksheet we are
going to go back to a question asked in the worksheet on
A decision tree is a conjunction of disjunctions of unary predicates. This
representation is quite limiting. The expert system from
last week allows a much more flexible representation - in
fact any logical form is allowed ... How could a
machine learning algorithm make use of the extra
We will use a genetic algorithm to search the space of
boolean formulas containing:
- the usual connectives and, or, and
- any rules you have created earlier, and
- automatically generated rules that match a randomly chosen word.
An example rule is contains( "cash" ) or
contains( "money" ) which matches any email
that contains the words "money" or "cash".
Genetic algorithms are a form of randomised search, so they
do not give the same result everytime they are run. There
are also many parameters that can affect the performance of
a genetic algorithm. The goal of this tutorial is to give
you an understanding of how the parameters affect the
performance of the genetic algorithm.
Your first task is to get the genetic algorithm spam
filter up and running. To do so, change to the directory
where you installed the software and run SISC:
SISC (1.8.8) - main
Now load the file spam-filter.scm,
which runs the genetic algorithm spam filter. You should
see output similar to that below:
#;> (load "spam-filter.scm")
67.0 Spam classification done
You should recognise this from last weeks - the basic
framework hasn't changed. We're now ready to start
exploring the genetic algorithm filter.
Classifying Genuine Email
Your main task this week is to explore how four
parameters: population size, mutation rate, crossover
rate, and number of iterations, affect the performance of
the genetic algorithm.
Below are four questions you should try to answer:
You should attempt to answer these questions
systematically. Investigate first one parameter, holding
all other parameters constant, then modify another
parameter and see if that changes the effect of the first,
and so on.
- What is the relationship between performance and
population size? Does performance increase directly
with population size, or is the relationship more
- What is the relationship between crossover rate and
performance? What is the optimal crossover rate? Is
crossover even necessary for good performance?
- What about the relationship between mutation rate
and performance? Is mutation necessary for good
- Finally, how are the number of iterations related to
performance? Does running the genetic algorithm for
more generations always result in better performance, or
is there a cutoff where more generations don't help?
To save time consider sharing your results with other
- Choose a set of values for each parameter. Values
between 1-1000 are good for population size and number
of generations. Mutation rate and crossover rate should
be between 0.0 and 1.0, and always sum to 1.0 or less.
You should choose at least 3 values for each parameter.
- For each combination of values run at least one
experiment. As the genetic algorithm is a random search
you may want to run it multiple times with each
combination of parameters to gain more accurate results.
- Record the performance of the genetic algorithm for
each combination of parameters.
To run the genetic algorithm, within SISC type the command:
#;> (start-ga number-of-generations
All parameters are numbers, except for trace?
which is #t or #f to turn tracing on
or off. With tracing on you can see how the genetic
algorithm evolves from one generation to the next. This
will enable you to see if it reaches optimal performance
before it has finished running.
There are many other modifications you can make to the
basic genetic algorithm. If you have time implement a few
of the suggestions below.
- The fitness function can have a big effect on the
performance of the genetic algorithm. Change the
fitness function in Scorer.java to use the
sigmoid function covered in your Neural Network notes.
- Elitism is a modification to the basic selection
algorithm that ensures the best performing individual
always progresses to the next generation. Modify
Selector.java to implement elitism.
© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
The URL of this page is ftp://ftp.cs.bham.ac.uk/pub/authors/M.Kerber/Teaching/AI/e5.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/