### Introduction to AI - Worksheet 5

In this tutorial we are going to explore our final spam filtering technique: genetic algorithms. As usual you will find installation instructions on the module web page http://www.cs.bham.ac.uk/ mmk/Teaching/AI/index.html.

As discussed in the lectures, genetic algorithms can be used to search a space of hypotheses. In this worksheet we are going to go back to a question asked in the worksheet on decision trees:

 A decision tree is a conjunction of disjunctions of unary predicates. This representation is quite limiting. The expert system from last week allows a much more flexible representation - in fact any logical form is allowed ... How could a machine learning algorithm make use of the extra flexibility?

We will use a genetic algorithm to search the space of boolean formulas containing:

1. the usual connectives and, or, and not,
2. any rules you have created earlier, and
3. automatically generated rules that match a randomly chosen word.

An example rule is contains( "cash" ) or contains( "money" ) which matches any email that contains the words "money" or "cash".

Genetic algorithms are a form of randomised search, so they do not give the same result everytime they are run. There are also many parameters that can affect the performance of a genetic algorithm. The goal of this tutorial is to give you an understanding of how the parameters affect the performance of the genetic algorithm.

#### EXERCISE: 1

Your first task is to get the genetic algorithm spam filter up and running. To do so, change to the directory where you installed the software and run SISC:

```  > sisc
SISC (1.8.8) - main
#;>
```
Now load the file spam-filter.scm, which runs the genetic algorithm spam filter. You should see output similar to that below:

```Classifying Spam
==================
Classifying Spam
==================
67.0  Spam classification done   Classifying Genuine Email
==================
...
```
You should recognise this from last weeks - the basic framework hasn't changed. We're now ready to start exploring the genetic algorithm filter.

#### EXERCISE: 2

Your main task this week is to explore how four parameters: population size, mutation rate, crossover rate, and number of iterations, affect the performance of the genetic algorithm.

Below are four questions you should try to answer:

• What is the relationship between performance and population size? Does performance increase directly with population size, or is the relationship more complicated?
• What is the relationship between crossover rate and performance? What is the optimal crossover rate? Is crossover even necessary for good performance?
• What about the relationship between mutation rate and performance? Is mutation necessary for good performance?
• Finally, how are the number of iterations related to performance? Does running the genetic algorithm for more generations always result in better performance, or is there a cutoff where more generations don't help?
You should attempt to answer these questions systematically. Investigate first one parameter, holding all other parameters constant, then modify another parameter and see if that changes the effect of the first, and so on.

Specifically:

• Choose a set of values for each parameter. Values between 1-1000 are good for population size and number of generations. Mutation rate and crossover rate should be between 0.0 and 1.0, and always sum to 1.0 or less. You should choose at least 3 values for each parameter.
• For each combination of values run at least one experiment. As the genetic algorithm is a random search you may want to run it multiple times with each combination of parameters to gain more accurate results.
• Record the performance of the genetic algorithm for each combination of parameters.
To save time consider sharing your results with other people.

To run the genetic algorithm, within SISC type the command:

```  #;> (start-ga number-of-generations
population-size
mutation-rate
crossover-rate
trace?)
```
All parameters are numbers, except for trace? which is #t or #f to turn tracing on or off. With tracing on you can see how the genetic algorithm evolves from one generation to the next. This will enable you to see if it reaches optimal performance before it has finished running.

#### EXERCISE: 3

There are many other modifications you can make to the basic genetic algorithm. If you have time implement a few of the suggestions below.

• The fitness function can have a big effect on the performance of the genetic algorithm. Change the fitness function in Scorer.java to use the sigmoid function covered in your Neural Network notes.
• Elitism is a modification to the basic selection algorithm that ensures the best performing individual always progresses to the next generation. Modify Selector.java to implement elitism.

© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
22.11.2004