Introduction to AI - Worksheet 2 The assessable task for this course is to develop an intelligent spam filter. In this exercise we are going to explore our first spam filtering technique: the expert system. The basic idea is quite simple: we write a number of rules that classify emails as spam or genuine. Complications arise when we try to write rules that match only spam not genuine email, and when we must combine several rules that give differing classifications. This is the essence of the approach taken by the SpamAssassin filter. I recommend you read a bit about SpamAssassin to get some ideas for the rest of the exercise. We have prepared two corpi of emails: one of genuine emails, and one of spam. At the moment the corpi are quite small. Later we will want bigger corpi. Gathering spam emails is easy (there are several repositories online) but gathering genuine email is difficult. Hence we request that you contribute any genuine emails you are prepared to see in the corpus. If you have emails to contribute please forward them to nhw@cs.bham.ac.uk with subject line Email for Corpus. The corpi can be obtained from the course webpage. You should download and extract them in your local directory. When the corpi get larger we will install them in a globally accessible part of the School's filesystem to save your quota. We have also prepared a software framework for this task. The framework can also be obtained from the course webpage, and you should download and extract it. The code all runs in the JVM, but it is not all Java. The interesting parts of the code are written in the SISC implementation of Scheme, and we make use of the Schelog logic programming library, which happens to run in Scheme. You don't need to learn to Scheme to complete the exercise but if you are interested in learning new ideas (and Scheme and Schelog will expose you to many) there are many good resources online. Some of the better ones are: 1. Teach Yourself Scheme ([1]http://www.ccs.neu.edu/home/dorai/t-y-scheme/t-y-scheme.html) A quick introduction to Scheme if you are familiar with another programming language. 2. SISC Home ([2]http://sisc.sourceforge.net/) The SISC homepage has downloadable and online documentation for the SISC implementation of Scheme. This won't teach you how to use Scheme but will tell you how to use the features that are unique to SISC, such as its integration with Java. 3. Schemers.org ([3]http://www.schemers.org/) Links to many more Scheme resources the majority of which are available online. In particular see How to Design Programs (HtDP), and Structure and Interpretation of Computer Programs (SICP). These are two of the best books on programming and both use Scheme (though the focus is not on Scheme, but rather fundamental concepts). HtDP is a gentler journey; SICP covers more ground. 4. Schelog Home ([4]http://www.ccs.neu.edu/home/dorai/schelog/schelog.html) Documents the Schelog logic programming system that runs within Scheme. If Scheme doesn't interest you, you won't need to learn much to do this tutorial. EXERCISE: 1 Your first task is to get the expert system spam filter up and running. To do so, change to the directory where you installed the software and run SISC: > sisc SISC (1.8.8) - main #;> Now load the file spam-filter.scm, which runs the expert system spam filter. You should see output similar to that below: #;> (load "spam-filter.scm") Classifying spam ================== spam 1: Classified by (fake-url) spam 2: Not classifed spam 3: Classified by (africa) ... spam classification done Classifying email ================== ... If you get to this step, congratulations, everything works and we're now ready to start extending the expert system spam filter. EXERCISE: 2 Your next task is to add a new rule to the system to classify more emails as spam. Inspect a spam email that is currently not classified. In my output above spam 2 is not classified, so I'll inspect it: #;> (show-spam 2) #