Introduction to AI - Worksheet 5

Read this worksheet and install the software before your tutorial.

In this tutorial we are going to explore our final spam filtering technique: genetic algorithms. As usual you will find installation instructions on the module web page http://www.cs.bham.ac.uk/ mmk/Teaching/AI/index.html.

As discussed in the lectures, genetic algorithms can be used to search a space of hypotheses. In this worksheet we are going to go back to a question asked in the worksheet on decision trees:

A decision tree is a conjunction of disjunctions of unary predicates. This representation is quite limiting. The expert system from last week allows a much more flexible representation - in fact any logical form is allowed ... How could a machine learning algorithm make use of the extra flexibility?

We will use a genetic algorithm to search the space of boolean formulas containing:

  1. the usual connectives and, or, and not,
  2. any rules you have created earlier, and
  3. automatically generated rules that match a randomly chosen word.

An example rule is contains( "cash" ) or contains( "money" ) which matches any email that contains the words "money" or "cash".

Genetic algorithms are a form of randomised search, so they do not give the same result everytime they are run. There are also many parameters that can affect the performance of a genetic algorithm. The goal of this tutorial is to give you an understanding of how the parameters affect the performance of the genetic algorithm.

EXERCISE: 1

Your first task is to get the genetic algorithm spam filter up and running. To do so, change to the directory where you installed the software and run SISC:

  > sisc
  SISC (1.8.8) - main
  #;> 
Now load the file spam-filter.scm, which runs the genetic algorithm spam filter. You should see output similar to that below:

Classifying Spam
==================
  #;> (load "spam-filter.scm")
  Classifying Spam
  ==================
  67.0  Spam classification done

Classifying Genuine Email ================== ...

You should recognise this from last weeks - the basic framework hasn't changed. We're now ready to start exploring the genetic algorithm filter.

EXERCISE: 2

Your main task this week is to explore how four parameters: population size, mutation rate, crossover rate, and number of iterations, affect the performance of the genetic algorithm.

Below are four questions you should try to answer:

You should attempt to answer these questions systematically. Investigate first one parameter, holding all other parameters constant, then modify another parameter and see if that changes the effect of the first, and so on.

Specifically:

To save time consider sharing your results with other people.

To run the genetic algorithm, within SISC type the command:

  #;> (start-ga number-of-generations
                population-size
                mutation-rate
                crossover-rate
                trace?)
All parameters are numbers, except for trace? which is #t or #f to turn tracing on or off. With tracing on you can see how the genetic algorithm evolves from one generation to the next. This will enable you to see if it reaches optimal performance before it has finished running.

EXERCISE: 3

There are many other modifications you can make to the basic genetic algorithm. If you have time implement a few of the suggestions below.


© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
22.11.2004
The URL of this page is ftp://ftp.cs.bham.ac.uk/pub/authors/M.Kerber/Teaching/AI/e5.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/