Introduction to AI - Worksheet 2

The assessable task for this course is to develop an intelligent spam filter. In this exercise we are going to explore our first spam filtering technique: the expert system. The basic idea is quite simple: we write a number of rules that classify emails as spam or genuine. Complications arise when we try to write rules that match only spam not genuine email, and when we must combine several rules that give differing classifications. This is the essence of the approach taken by the SpamAssassin filter. I recommend you read a bit about SpamAssassin to get some ideas for the rest of the exercise.

We have prepared two corpi of emails: one of genuine emails, and one of spam. At the moment the corpi are quite small. Later we will want bigger corpi. Gathering spam emails is easy (there are several repositories online) but gathering genuine email is difficult. Hence we request that you contribute any genuine emails you are prepared to see in the corpus. If you have emails to contribute please forward them to nhw@cs.bham.ac.uk with subject line Email for Corpus.

The corpi can be obtained from the course webpage. You should download and extract them in your local directory. When the corpi get larger we will install them in a globally accessible part of the School's filesystem to save your quota.

We have also prepared a software framework for this task. The framework can also be obtained from the course webpage, and you should download and extract it. The code all runs in the JVM, but it is not all Java. The interesting parts of the code are written in the SISC implementation of Scheme, and we make use of the Schelog logic programming library, which happens to run in Scheme. You don't need to learn to Scheme to complete the exercise but if you are interested in learning new ideas (and Scheme and Schelog will expose you to many) there are many good resources online. Some of the better ones are:

  1. Teach Yourself Scheme
    (http://www.ccs.neu.edu/home/dorai/t-y-scheme/t-y-scheme.html) A quick introduction to Scheme if you are familiar with another programming language.
  2. SISC Home (http://sisc.sourceforge.net/) The SISC homepage has downloadable and online documentation for the SISC implementation of Scheme. This won't teach you how to use Scheme but will tell you how to use the features that are unique to SISC, such as its integration with Java.
  3. Schemers.org (http://www.schemers.org/) Links to many more Scheme resources the majority of which are available online. In particular see How to Design Programs (HtDP), and Structure and Interpretation of Computer Programs (SICP). These are two of the best books on programming and both use Scheme (though the focus is not on Scheme, but rather fundamental concepts). HtDP is a gentler journey; SICP covers more ground.
  4. Schelog Home
    (http://www.ccs.neu.edu/home/dorai/schelog/schelog.html) Documents the Schelog logic programming system that runs within Scheme.

If Scheme doesn't interest you, you won't need to learn much to do this tutorial.

EXERCISE: 1

Your first task is to get the expert system spam filter up and running. To do so, change to the directory where you installed the software and run SISC:

  > sisc
  SISC (1.8.8) - main
  #;> 
Now load the file spam-filter.scm, which runs the expert system spam filter. You should see output similar to that below:

  #;> (load "spam-filter.scm")
  Classifying spam
  ==================
  spam 1: Classified by (fake-url)
  spam 2: Not classifed
  spam 3: Classified by (africa)
  ...
  spam classification done
  
  Classifying email
  ==================
  ...

If you get to this step, congratulations, everything works and we're now ready to start extending the expert system spam filter.

EXERCISE: 2

Your next task is to add a new rule to the system to classify more emails as spam. Inspect a spam email that is currently not classified. In my output above spam 2 is not classified, so I'll inspect it:

  #;> (show-spam 2)
  #<java java.lang.String 
  <html>
  </head>
  <body>
  <p align="center">
  <a href="http://jotha.expertfreestuff.com/dell2/?s=quo0a7879...
Note that the show-spam function only shows the body (and only the first body of multipart emails). If you want to get fancy you can look at the headers and other parts of multipart emails. Read the Javamail API documentation for this. Now think of a way you could classify this email as spam. Try you think of a way to classify it that will catch as much similar spam as possible without incorrectly classifing any genuine email. Write a rule to do this. There are two example rules you can base your code off: AfricaRule.java and fake-url-rule.scm, written in Java and Scheme respectively.

If you write a rule in Java it must implement the Rule interface. It will automatically be added to the list of rules once you have compiled it. If you write a rule in Scheme you must call add-rule! to add it to the list of available rules. Once you added your rule, reload spam-filter.scm and see what effect your rule has. Make sure no genuine emails are incorrectly classified!

EXERCISE: 3

With the small corpi we are using it is easy to write rules that correctly classify all emails. However it will not be so easy to do this when the corpi grow. Often a rule will classify some spam correctly, but also classify a few genuine emails as spam. We must do some form of conflict resolution to get around this problem. One simple form of conflict resolution is to base the classification on combinations of rules. For example, "Nigerian scam" emails often talk about airlines, Africa, and bank accounts. However a genuine email about, say, a holiday, could discuss airlines. Therefore we might get better performance if we only classify an email as spam if all of airlines, Africa and bank accounts are present. Implement this. You will need two steps:

  1. Implement the basic rules to identify airlines, Africa and bank accounts.
  2. Modify the email is classified as spam if all three rules match, but not if any one matches individually. You will need to read the Schelog documentation to do this.

If you have time implement some more rules. Extend the spam filter as far as your imagination and free time permit.


© Noel Welsh & Manfred Kerber, 2004, Introduction to AI
5.10.2004
The URL of this page is http://www.cs.bham.ac.uk/~mmk/Teaching/AI/e2.html.
URL of module http://www.cs.bham.ac.uk/~mmk/Teaching/AI/