### Introduction to AI - Worksheet 4

In this tutorial we are going to explore our third spam filtering technique: naive bayes classifiers. As usual you will find installation instructions on the course web page http://www.cs.bham.ac.uk/ mmk/Teaching/AI/index.html.

A note on notation: In the `bag of words' representation we represent each email as a vector of features. Each feature corresponds to a word in the corpus, and is 1 if that word is present in the email and 0 otherwise.

emaili = [feature1, feature2, ... , featuren]

We will use P(featurei = 1 | class = c) or just P(featurei | class = c) to represent the probability of a feature i occurring in an email given that the email belongs to a class c, where c may be genuine or spam. P(spam) will denote denote either the prior probability that an email is spam (before looking at its content), or the posterior probability that an email is spam after looking at its content. The later is more correctly written P(class = spam | emaili) or P(class = spam | [feature1, feature2, ... , featuren]). Similarly for P(genuine). The usage will be clear from context.

You may have noticed two problems with the decision tree techniques we used last week:

• The decision trees are limited by the available rules. If no rule classifies an email, the decision tree cannot classify it.
• Classifying a genuine email as spam is much worse than classifying a spam as genuine, however the decision tree algorithm does not take account of this.

We are going to address these two problems. The first problem we will address by using the `bag of words' representation discussed in last week's tutorial. The second problem we can address by taking advantage of the probabilities the naive bayes algorithm generates.

The naive bayes algorithm will generate two posterior probabilities, P(spam) and P(genuine). If the cost of misclassifying spam as genuine email was the same as the cost of misclassifying genuine email as spam we should classify email as spam if P(spam) > P(genuine). However we've said that the cost of misclassifying a genuine email is much greater than the cost of misclassifying a spam email. We can factor in these costs. For each prediction we make we can write out the cost if the prediction is correct or wrong:

 Predict spam Predict genuine -------- ------------- ------------- Is spam 1 10 -------- ------------- ------------- Is genuine 100 1 -------- ------------- -------------

In the above example the cost of classifying spam as genuine is 10, the cost of classifying genuine as spam is 100 and the cost of making the correct prediction is 1.

Now the posterior probabilities are our best guess of the likelihood of each case (is spam or is genuine) being true. Let us say we predict P(spam) = 0.99 and P(genuine) = 0.01. We can multiply the costs by the probability of the true case (is spam or is genuine) to get the expected cost of each action:

 Predict spam Predict genuine -------- ------------- ------------- Is spam 0.99 9.9 -------- ------------- ------------- Is genuine 1.0 0.01 -------- ------------- -------------

Since we don't know if the email really is spam or genuine we must sum up the cost of each prediction and then choose the prediction with the lowest expected cost.

 Predict spam Predict genuine -------- ------------- ------------- Cost 1.99 9.91 -------- ------------- -------------

Hence in the above example we would choose to classify the email as spam.

#### EXERCISE: 1

Before you begin work on the naive bayes classifer, run your work from last week. You should see that the corpus has increased greatly in size, and your rules no longer classify spam as well as before. This illustrates the first problem we raised - the limitation of hand-crafted rules.

#### EXERCISE: 2

Your first task is to get the naive bayes spam filter up and running. To do so, change to the directory where you installed the software and run SISC:

```  > sisc
SISC (1.8.8) - main
#;>
```
Now load the file spam-filter.scm, which runs the naive bayes spam filter. You should see output similar to that below:

```  #;> (load "spam-filter.scm")
Classifying Spam
==================
Spam 1: Classified by (naive-bayes)
Spam 2: Classified by (naive-bayes)
Spam 3: Not classified (naive-bayes)
Spam 4: Classified by (naive-bayes)
...
spam classification done

Classifying email
==================
...
```
You should recognise this from last week - the basic framework hasn't changed. We're now ready to start extending the naive bayes spam filter.

#### EXERCISE: 3

Your main task for this week is implement the naive bayes classifier. A skeleton implementation is provided in NaiveBayesClassifer.java. You must complete the implementation.

The current implementation calculates the prior probabilities P(spam) and P(genuine) and classifies emails based on the naive decision rule that does not take cost into account.

You should start by adding code to calculate the class conditional probabilities P(featurei | class = c). Let ni,spam be the number of occurrences of word i in spam, and nspam be the total number of spam emails. You can calculate P(featurei | class = spam) using the equation:

P(featurei | class = spam) = [ni,spam]/[nspam]

Similarly for P(featurei | class = genuine)

When you have done this add code to calculate the posterior probabilities P(spam) and P(genuine) using the equations given in your notes. Now compile your code and reload spam-filter.scm to see if your changes have improved the naive bayes filter. How many spams are misclassified? How many genuine emails?

#### EXERCISE: 4

Add a cost sensitive decision rule. Assume that misclassifying a genuine email is a thousand times worse than misclassifying a spam email.

#### EXERCISE: 5

There are many enhancements that can be made to the basic naive bayes algorithm. The most important modifications avoid inaccurate estimates of class conditional probabilities when the number of observed occurrences is low. Consider the case when a word occurs once in a spam email and never in a genuine email. In the above equation the class conditional probability P(featurei | class = spam) = 1.0 and P(featurei | class = genuine) = 0.0. The evidence doesn't justify these strong judgements. We will add in a `fudge factor' and pretend we have seen every word a small number of times. Let f be the `fudge factor'. Then we can adjust the calculation of P(featurei | class = spam) using the equation:

P(featurei | class = spam) = [ni,spam + 0.5f]/[nspam + f]
Make this modification to the naive bayes classifer, setting f to 5.

#### EXERCISE: 6

Read some of the existing research on the naive bayes classifier, and enhance your classify using the ideas you find therein. Suggested papers include:

© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
22.11.2004