Assignment Introduction This project accounts for 30% of your mark for the Introduction to Artificial Intelligence module. It is to be done in the group you formed at the beginning of term. This project is due (code and report) in the final week of term, Thursday, 9 December 2004, 12:00 noon. Late submissions will be penalised 5 marks for any 24 hours or part thereof they are late. Submissions received after Friday, 17 December 2004, 12:00 noon, will not be accepted and receive a zero mark. Code and a PDF or Postscript version of your report must be submitted using the ai-intro-submit command. This command works in exactly the same way as the fyw-submit command used in the first year Java course. Ensure you keep the email confirming submission. Task The aim of this assignment is: to demonstrate your understanding of formalisms and algorithms used in artificial intelligence; to apply them to the example problem of email spam filtering; and to demonstrate your ability to conduct a scientific study into an aspect of artificial intelligence. To complete the assignment you must create a spam filter and then write a report describing and evaluating your filter. You will get 40% of the marks for this assignment based on the quality and effectiveness of your implementation, of which half will be allocated for the code and half for the effectiveness of the filter. We will test your spam filter on a different corpus to the one used in the tutorials. You should not assume that, for instance, there will be any emails from Birmingham accounts. The only restriction on your implementation is that it must contain a class called SpamFilter that instantiates your spam filter via a constructor with no arguments. In other words, the main class of your spam filter must be the class SpamFilter. In addition SpamFilter must implement the Classifier interface given in the Naive Bayes code. This is to allow us to automate the evaluation of your spam filter. The Classifier interface defines two methods of interest: * The train method will be called when your spam filter is instantiated and is the chance for your spam filter to train on a pre-classified corpus of spam and genuine emails. * The classify method will be called to classify an email as spam or genuine. Your spam filter should return true if it classifies the message as spam, and false otherwise. You may use any part of the code from the tutorials but we will expect to see significant portions of your own code as well. If your spam filter makes cost-sensitive classifications you should assume that classifying a genuine email as spam has 1000 times the cost of classifying a spam email as genuine. Your report will get the other 60% of the marks for this assignment. Your report should detail: * The ideas behind your spam filter, and a justification for these ideas * A discussion of the implementation, concentrating on the main algorithm you used. * An evaluation of your implementation * An assessment of your algorithm, in particular where and why it succeeds and fails, how it is better than other alternatives you considered, and how it could be improved. This is the most important part of the write-up. Where possible present experimental evidence or detailed analysis to back up your claims. * A reference list containing any sources you used when completing this assignment. This document contains a reference list should can use as a model. If you do not reference your sources you may be found guilty of plagarism and in extreme cases expelled from the your course! Your report should be 5 to 10 pages long and set in a font no smaller than 10 points. Only PDF or Postscript format is acceptable for the report. Reports using other formats, such as Word, RTF or HTML, will not be marked. You can convert a Word document to a Postscript file by printing it to a file from within OpenOffice, available on all Linux machines. The front page of your report must include a declaration of the relative contribution of your team members. Marks will be allocated based on this declaration. You should use the research papers referenced in the tutorial worksheets as a guide for the quality of work to aim for. In particular see how spam filters are assessed in the papers. It is important that you aim to give a similarly thorough assessment. Example Projects As this assignment is fairly open ended, below are some example projects in rough order from easiest to most difficult. You may, of course, do something completely different. * Implement the extensions to the Naive Bayes algorithm described in [Kohavi et al.(1997)] and show if performance on the basic algorithm is improved by doing so. * Alter the ID3 algorithm to make cost-sensitive decisions and to use the bag-of-words representation. * Integrate rules and the bag-of-words representation into a classifier. You could use hand-coded rules, or better yet, learn rules using a technique such as a genetic algorithm. * Implement a boosting classifier as described in [Carreras and Marquez(2001)] and [Elkan(1997)]. Resources The corpus of spam and genuine emails used in the tutorials will continue to grow and remain available at /bham/common/courses/ai-intro/corpus/ You are welcome to contribute to the corpus - doing so will help everyone. The worksheets have contained references to a few of the many papers on spam filtering. An excellent resource is the list of papers at [Tansuwan(Accessed 2004)]. This webpage also links to an extensive bibliography on spam filtering. Google and Citeseer are excellent resources for finding material on spam filtering. The writings of Paul Graham[Graham(2002), Graham(2003)] provide interesting insights into the practical aspects of spam filtering. * Carreras and Marquez(2001) Xavier Carreras and Lluis Marquez. Boosting trees for anti-spam email filtering. In \em Proceedings of RANLP-2001, 4th International Conference on Recent Advances in Natural Language Processing, 2001. URL [1]http://www.lsi.upc.es/~carreras/pub/boospam.ps. * Elkan(1997) Charles Elkan. Boosing and naive bayesian learning. Technical Report CS97-557, UCSD, 1997. URL [2]http://www-cse.ucsd.edu/users/elkan/papers/bnb.ps. * Graham(2002) Paul Graham. A plan for spam. WWW Page, 2002. URL [3]http://www.paulgraham.com/spam.html. * Graham(2003) Paul Graham. Better bayesian filtering. WWW Page, 2003. URL [4]http://www.paulgraham.com/better.html. * Kohavi et al.(1997) R. Kohavi, B. Becker, and D. Sommerfield. Improving simple bayes. In \em Proceedings of the European Conference on Machine Learning, 1997. URL [5]http://robotics.stanford.edu/%7Eronnyk/impSBC.ps.Z. * Tansuwan(Accessed 2004) Justin Jesada Tansuwan. Spam classification papers. WWW Page, Accessed 2004. URL [6]http://www.stanford.edu/~jjtswan/research/spam/spam.html. _________________________________________________________________________________________________________ © Noel Welsh & Manfred Kerber, 2004, Introduction to AI 3.11.2004 The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/eass.html. URL of module [7]http://www.cs.bham.ac.uk/~mmk/Teaching/AI/ References 1. http://www.lsi.upc.es/~carreras/pub/boospam.ps 2. http://www-cse.ucsd.edu/users/elkan/papers/bnb.ps 3. http://www.paulgraham.com/spam.html 4. http://www.paulgraham.com/better.html 5. http://robotics.stanford.edu/%7Eronnyk/impSBC.ps.Z 6. http://www.stanford.edu/~jjtswan/research/spam/spam.html 7. http://www.cs.bham.ac.uk/~mmk/Teaching/AI/