The assessable task for this course is to develop an intelligent spam filter. In this exercise we are going to explore our first spam filtering technique: the expert system. The basic idea is quite simple: we write a number of rules that classify emails as spam or genuine. Complications arise when we try to write rules that match only spam not genuine email, and when we must combine several rules that give differing classifications. This is the essence of the approach taken by the SpamAssassin filter. I recommend you read a bit about SpamAssassin to get some ideas for the rest of the exercise.
We have prepared two corpi of emails: one of genuine emails, and one of spam. At the moment the corpi are quite small. Later we will want bigger corpi. Gathering spam emails is easy (there are several repositories online) but gathering genuine email is difficult. Hence we request that you contribute any genuine emails you are prepared to see in the corpus. If you have emails to contribute please forward them to nhw@cs.bham.ac.uk with subject line Email for Corpus.
The corpi can be obtained from the course webpage. You should download and extract them in your local directory. When the corpi get larger we will install them in a globally accessible part of the School's filesystem to save your quota.
We have also prepared a software framework for this task. The framework can also be obtained from the course webpage, and you should download and extract it. The code all runs in the JVM, but it is not all Java. The interesting parts of the code are written in the SISC implementation of Scheme, and we make use of the Schelog logic programming library, which happens to run in Scheme. You don't need to learn to Scheme to complete the exercise but if you are interested in learning new ideas (and Scheme and Schelog will expose you to many) there are many good resources online. Some of the better ones are:
> sisc SISC (1.8.8) - main #;>Now load the file spam-filter.scm, which runs the expert system spam filter. You should see output similar to that below:
#;> (load "spam-filter.scm") Classifying spam ================== spam 1: Classified by (fake-url) spam 2: Not classifed spam 3: Classified by (africa) ... spam classification done Classifying email ================== ...
If you get to this step, congratulations, everything works and we're now ready to start extending the expert system spam filter.
#;> (show-spam 2) #<java java.lang.String <html> </head> <body> <p align="center"> <a href="http://jotha.expertfreestuff.com/dell2/?s=quo0a7879...Note that the show-spam function only shows the body (and only the first body of multipart emails). If you want to get fancy you can look at the headers and other parts of multipart emails. Read the Javamail API documentation for this. Now think of a way you could classify this email as spam. Try you think of a way to classify it that will catch as much similar spam as possible without incorrectly classifing any genuine email. Write a rule to do this. There are two example rules you can base your code off: AfricaRule.java and fake-url-rule.scm, written in Java and Scheme respectively.
If you write a rule in Java it must implement the Rule interface. It will automatically be added to the list of rules once you have compiled it. If you write a rule in Scheme you must call add-rule! to add it to the list of available rules. Once you added your rule, reload spam-filter.scm and see what effect your rule has. Make sure no genuine emails are incorrectly classified!