**Read this worksheet and install the software before
your tutorial.**

In this tutorial we are going to explore our third spam filtering technique: naive bayes classifiers. As usual you will find installation instructions on the course web page http://www.cs.bham.ac.uk/ mmk/Teaching/AI/index.html.

A note on notation: In the `bag of words' representation we represent each email as a vector of features. Each feature corresponds to a word in the corpus, and is 1 if that word is present in the email and 0 otherwise.

We will use P(*feature*_{i} = 1 | class = c) or just
P(*feature*_{i} | class = c) to represent the probability of a
feature i occurring in an email given that the email
belongs to a class c, where c may be genuine or
spam. P(spam) will denote denote either the prior
probability that an email is spam (before looking at its
content), or the posterior probability that an email is spam
after looking at its content. The later is more correctly
written P(class = spam | email_{i}) or P(class = spam |
[*feature*_{1}, *feature*_{2}, ... , *feature*_{n}]). Similarly for
P(genuine). The usage will be clear from context.

You may have noticed two problems with the decision tree techniques we used last week:

- The decision trees are limited by the available rules. If no rule classifies an email, the decision tree cannot classify it.
- Classifying a genuine email as spam is much worse than classifying a spam as genuine, however the decision tree algorithm does not take account of this.

We are going to address these two problems. The first problem we will address by using the `bag of words' representation discussed in last week's tutorial. The second problem we can address by taking advantage of the probabilities the naive bayes algorithm generates.

The naive bayes algorithm will generate two posterior probabilities, P(spam) and P(genuine). If the cost of misclassifying spam as genuine email was the same as the cost of misclassifying genuine email as spam we should classify email as spam if P(spam) > P(genuine). However we've said that the cost of misclassifying a genuine email is much greater than the cost of misclassifying a spam email. We can factor in these costs. For each prediction we make we can write out the cost if the prediction is correct or wrong:

Predict spam | Predict genuine | |

-------- | ------------- | ------------- |

Is spam | 1 | 10 |

-------- | ------------- | ------------- |

Is genuine | 100 | 1 |

-------- | ------------- | ------------- |

In the above example the cost of classifying spam as genuine is 10, the cost of classifying genuine as spam is 100 and the cost of making the correct prediction is 1.

Now the posterior probabilities are our best guess of the likelihood of each case (is spam or is genuine) being true. Let us say we predict P(spam) = 0.99 and P(genuine) = 0.01. We can multiply the costs by the probability of the true case (is spam or is genuine) to get the expected cost of each action:

Predict spam | Predict genuine | |

-------- | ------------- | ------------- |

Is spam | 0.99 | 9.9 |

-------- | ------------- | ------------- |

Is genuine | 1.0 | 0.01 |

-------- | ------------- | ------------- |

Since we don't know if the email really is spam or genuine we must sum up the cost of each prediction and then choose the prediction with the lowest expected cost.

Predict spam | Predict genuine | |

-------- | ------------- | ------------- |

Cost | 1.99 | 9.91 |

-------- | ------------- | ------------- |

Hence in the above example we would choose to classify the email as spam.

> sisc SISC (1.8.8) - main #;>Now load the file

#;> (load "spam-filter.scm") Classifying Spam ================== Spam 1: Classified by (naive-bayes) Spam 2: Classified by (naive-bayes) Spam 3: Not classified (naive-bayes) Spam 4: Classified by (naive-bayes) ... spam classification done Classifying email ================== ...You should recognise this from last week - the basic framework hasn't changed. We're now ready to start extending the naive bayes spam filter.

The current implementation calculates the prior probabilities P(spam) and P(genuine) and classifies emails based on the naive decision rule that does not take cost into account.

You should start by adding code to calculate the class
conditional probabilities P(*feature*_{i} | class = c). Let
n_{i,spam} be the number of occurrences of word i in
spam, and n_{spam} be the total number of spam emails. You can calculate
P(*feature*_{i} | class = spam) using the equation:

Similarly for P(*feature*_{i} | class = genuine)

When you have done this add code to calculate the
posterior probabilities P(spam) and P(genuine) using
the equations given in your notes.
Now compile your code and reload
`spam-filter.scm` to see if your changes
have improved the naive bayes filter. How many spams are
misclassified? How many genuine emails?

- The writings of Paul Graham particularly
http://www.paulgraham.com/spam.html and
http://www.paulgraham.com/better.html.
- An Evaluation of Naive Bayesian Anti-Spam Filtering

http://xxx.lanl.gov/abs/cs.CL/0006013. - Improving Simple Bayes

http://robotics.stanford.edu/%7Eronnyk/impSBC.ps.Z. - Boosting and Naive Bayesian Learning

http://www-cse.ucsd.edu/users/elkan/papers/bnb.ps.

© Noel Welsh & Manfred Kerber, 2004, Introduction to AI

22.11.2004

The URL of this page is ftp://ftp.cs.bham.ac.uk/pub/authors/M.Kerber/Teaching/AI/e4.html.

URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/