Late submissions will be penalised 5 marks for any 24 hours or part thereof they are late. Submissions received after Friday, 17 December 2004, 12:00 noon, will not be accepted and receive a zero mark. Code and a PDF or Postscript version of your report must be submitted using the ai-intro-submit command. This command works in exactly the same way as the fyw-submit command used in the first year Java course. Ensure you keep the email confirming submission.
The aim of this assignment is: to demonstrate your understanding of formalisms and algorithms used in artificial intelligence; to apply them to the example problem of email spam filtering; and to demonstrate your ability to conduct a scientific study into an aspect of artificial intelligence.
To complete the assignment you must create a spam filter and then write a report describing and evaluating your filter.
You will get 40% of the marks for this assignment based on the quality and effectiveness of your implementation, of which half will be allocated for the code and half for the effectiveness of the filter. We will test your spam filter on a different corpus to the one used in the tutorials. You should not assume that, for instance, there will be any emails from Birmingham accounts.
The only restriction on your implementation is that it must contain a class called SpamFilter that instantiates your spam filter via a constructor with no arguments. In other words, the main class of your spam filter must be the class SpamFilter. In addition SpamFilter must implement the Classifier interface given in the Naive Bayes code. This is to allow us to automate the evaluation of your spam filter.
The Classifier interface defines two methods of interest:
You may use any part of the code from the tutorials but we will expect to see significant portions of your own code as well.
If your spam filter makes cost-sensitive classifications you should assume that classifying a genuine email as spam has 1000 times the cost of classifying a spam email as genuine.
Your report will get the other 60% of the marks for this assignment. Your report should detail:
Your report should be 5 to 10 pages long and set in a font no smaller than 10 points. Only PDF or Postscript format is acceptable for the report. Reports using other formats, such as Word, RTF or HTML, will not be marked. You can convert a Word document to a Postscript file by printing it to a file from within OpenOffice, available on all Linux machines. The front page of your report must include a declaration of the relative contribution of your team members. Marks will be allocated based on this declaration.
You should use the research papers referenced in the tutorial worksheets as a guide for the quality of work to aim for. In particular see how spam filters are assessed in the papers. It is important that you aim to give a similarly thorough assessment.
As this assignment is fairly open ended, below are some example projects in rough order from easiest to most difficult. You may, of course, do something completely different.
The corpus of spam and genuine emails used in the tutorials will continue to grow and remain available at
You are welcome to contribute to the corpus - doing so will help everyone.
The worksheets have contained references to a few of the many papers on spam filtering. An excellent resource is the list of papers at [Tansuwan(Accessed 2004)]. This webpage also links to an extensive bibliography on spam filtering. Google and Citeseer are excellent resources for finding material on spam filtering. The writings of Paul Graham[Graham(2002), Graham(2003)] provide interesting insights into the practical aspects of spam filtering.