Read this worksheet and install the software before your tutorial.
In this tutorial we are going to explore our third spam filtering technique: naive bayes classifiers. As usual you will find installation instructions on the course web page http://www.cs.bham.ac.uk/ mmk/Teaching/AI/index.html.
A note on notation: In the `bag of words' representation we represent each email as a vector of features. Each feature corresponds to a word in the corpus, and is 1 if that word is present in the email and 0 otherwise.
We will use P(featurei = 1 | class = c) or just P(featurei | class = c) to represent the probability of a feature i occurring in an email given that the email belongs to a class c, where c may be genuine or spam. P(spam) will denote denote either the prior probability that an email is spam (before looking at its content), or the posterior probability that an email is spam after looking at its content. The later is more correctly written P(class = spam | emaili) or P(class = spam | [feature1, feature2, ... , featuren]). Similarly for P(genuine). The usage will be clear from context.
You may have noticed two problems with the decision tree techniques we used last week:
We are going to address these two problems. The first problem we will address by using the `bag of words' representation discussed in last week's tutorial. The second problem we can address by taking advantage of the probabilities the naive bayes algorithm generates.
The naive bayes algorithm will generate two posterior probabilities, P(spam) and P(genuine). If the cost of misclassifying spam as genuine email was the same as the cost of misclassifying genuine email as spam we should classify email as spam if P(spam) > P(genuine). However we've said that the cost of misclassifying a genuine email is much greater than the cost of misclassifying a spam email. We can factor in these costs. For each prediction we make we can write out the cost if the prediction is correct or wrong:
| Predict spam | Predict genuine | |
| -------- | ------------- | ------------- |
| Is spam | 1 | 10 |
| -------- | ------------- | ------------- |
| Is genuine | 100 | 1 |
| -------- | ------------- | ------------- |
In the above example the cost of classifying spam as genuine is 10, the cost of classifying genuine as spam is 100 and the cost of making the correct prediction is 1.
Now the posterior probabilities are our best guess of the likelihood of each case (is spam or is genuine) being true. Let us say we predict P(spam) = 0.99 and P(genuine) = 0.01. We can multiply the costs by the probability of the true case (is spam or is genuine) to get the expected cost of each action:
| Predict spam | Predict genuine | |
| -------- | ------------- | ------------- |
| Is spam | 0.99 | 9.9 |
| -------- | ------------- | ------------- |
| Is genuine | 1.0 | 0.01 |
| -------- | ------------- | ------------- |
Since we don't know if the email really is spam or genuine we must sum up the cost of each prediction and then choose the prediction with the lowest expected cost.
| Predict spam | Predict genuine | |
| -------- | ------------- | ------------- |
| Cost | 1.99 | 9.91 |
| -------- | ------------- | ------------- |
Hence in the above example we would choose to classify the email as spam.
> sisc SISC (1.8.8) - main #;>Now load the file spam-filter.scm, which runs the naive bayes spam filter. You should see output similar to that below:
#;> (load "spam-filter.scm") Classifying Spam ================== Spam 1: Classified by (naive-bayes) Spam 2: Classified by (naive-bayes) Spam 3: Not classified (naive-bayes) Spam 4: Classified by (naive-bayes) ... spam classification done Classifying email ================== ...You should recognise this from last week - the basic framework hasn't changed. We're now ready to start extending the naive bayes spam filter.
The current implementation calculates the prior probabilities P(spam) and P(genuine) and classifies emails based on the naive decision rule that does not take cost into account.
You should start by adding code to calculate the class conditional probabilities P(featurei | class = c). Let ni,spam be the number of occurrences of word i in spam, and nspam be the total number of spam emails. You can calculate P(featurei | class = spam) using the equation:
Similarly for P(featurei | class = genuine)
When you have done this add code to calculate the posterior probabilities P(spam) and P(genuine) using the equations given in your notes. Now compile your code and reload spam-filter.scm to see if your changes have improved the naive bayes filter. How many spams are misclassified? How many genuine emails?