3rd/4th Year UG and MSc

Intelligent Data Analysis (Extended)

Course Material and Useful Links

Peter Tino


Lecture Timetable and Handouts

Here is a preliminary outline of the module structure and lecture timetable. I will develop most of the ideas on whiteboard. You are strongly encouraged to take notes during the lectures. Some knowldege of maths is assumed, but more complicated notions will always be explained during the lectures. However, your maths should be better than this.

Week Lecture 1 Lecture 2
1 Module Administration. Overview of taught topics. Vectorial data. Basic concepts - vector spaces and matrix algebra.
2 Basic probability theory and statistics. [notes] Dimensionality reduction of vectorial data. Linear models. Principal Component Analysis I. [slides] [notes]
3 Principal Component Analysis II. [slides] [notes] [cov matrix example] Principal Component Analysis III. [slides] [notes]
4 Applications of PCA. [slides] [demo] data Document mining. [slides]
5 Representing documents. [slides] tfidf and information theory [slides]
6 Latent semantic indexing. [slides] Clustering. [slides]
7 On-line clustering. [slides] Constrained clustering. [slides]
8 Classification I. [demo] Classification II
9 Learnability and generalization. Assessing generalization, model selection.
10 Information retrieval. PageRank algorithm I. [slides] PageRank algorithm II. [slides]
11 Rank of page communities. [slides] Important topics not covered in this module.

12 Revision Lecture Covering the Whole Module

Useful Links

Suggested reading + software

Unassessed Exercises

  • Variance/Co-Variance calculations
  • Principal Component Analysis

  • Assignment for those taking "Intelligent Data Analysis Extended" (DEADLINE 1st March)

    You can chose any data set(s) from the list bellow. Of course, finding your own dataset to investigate is much more prefarable!
    If you decide to go the easier route and use some of the data sets provided, un-tar the relevant file and go to the corresponding folder.
    The folder contains the data set, as well as additional information about the data. Read the available information, especially description of the features (data dimensions).
    You will need to clean the data, so that it contains only numerical features (dimensions) and the features are space-separated (not comma-separated.

    To make the plots informative, you should come up with a labelling scheme for data points.
    If the data can be classified into several classes (find out in the data and feature description!), use that information as the basis for your labelling scheme. In that case exclude the class information from the data dimensions.
    Alternatively, you can make labels out of any dimension, e.g. by quantising it into several intervals. For example, if the data dimension represents age of a person, you can quantise it into 5 labels (classes) [child, teenager, young adult, middle age, old].
    Associate the data labels with different markers and use the markers to show what kind of data points get projected to different regions of the visualization plot (computer screen).

    Before starting to work on the assignment, please carefully study the example I prepared using the boston database. Un-tar the file boston.ex.tar.gz and go to the folder "BOSTON.EX".
    The subfolder "FIGURES" contains all the relevant figures as eps or gif files.
    Please consult the "boston.read.me" file in BOSTON.EX.

    The report should describe experiments with a chosen data set(s) along the lines of `boston example'.
    In the labeling scheme, concentrate on more than one coordinate (dimension), e.g. in the `boston example', consider not just the `price' feature, but run separate experiments with `per capita crime rate in the town', or 'pupil-teacher ratio in the town' instead of the `price' coordinate).

    In the report concentrate on the following questions:
    - How did you preprocess the data? (worth 20%)
    - What features (coordinates) did you use for labeling the projected points with different markers? (worth 10%)
    - How did you design the labeling schemes? (worth 20%)
    - What interesting aspects of the data did you detect based on the data visualisations? (worth 30%)
    - What interesting aspects of the data did you detect based on eigenvector and eigenvalue analysis of the data covariance matrix? (worth 20%)

    You should demonstrate that you
    - understand the visualisation techniques used
    - are able to extract useful information about otherwise inconceivable high-dimensional data using dimensionality-reducing visualisation techniques.

    The overall mark m (in the range 0-100%) will be linearly scaled to the range 0-20% by 0.2*m.

    For examples of nice past reports developed on wine dataset (do not use this data in your report!), please see reports by Christoph Stich and Josephf Preece. Many thanks Chris and Joe!

    Data Visualisation using PCA and SOM

    Try a Java applet for an interactive PCA/SOM created by Daoxiao Jin.

    Source code (tar+gzip)

    Preparing for the exam - Sample Questions

    Problems to solve for those really interested

    Try the models out! - Benchmark data sets

    As you get familiar with different types of models, try them out on benchmark data sets people in the machine learning community have been using to support their claims about yet another excellent learning system :-)

    Here are two of the widely used data repisitories that contain data description, data itself and other useful things, like previously obtained results.

  • DELVE - Data for Evaluating Learning in Valid Experiments
  • UCI Knowledge Discovery in Databases Archive

    Also, try these test collections that are used to measure the effectiveness of information retrieval and text mining systems. Again, the data repisitories contain data description, data itself and other useful things, like previously obtained results.

  • Reuters-21578. The most widely used text categorization test collection.
  • RCV1 (Reuters Corpus Volume 1). Recently released collection of news stories.
  • TREC-AP. A text categorization task based on the Associated Press articles.
  • 20 Newsgroups. 20000 messages taken from 20 Usenet newsgroups.

    Aims, Objectives, and Assessment

    For formal details about the aims and objectives and assessment you should look at the official Module Syllabus Page [IDA] [IDA Extended].

    IDA students will be assessed by examination (100%).
    IDA Extended students will be assessed based on examination (80%) and a continuous assessment by mini-project report (20%).

    Recommended Books

    The Recommended Books for this module are:

    Title Author(s) Publisher, Date Comments
    The Elements of Statistical Learning: Data Mining, Inference, and Prediction T. Hastile, R. Tibshirani, J. Friedman Springer, 2001 Comprehensive and cover many state-of-the-art statistical learning techniques and very helpful to understand the essence of Data Mining. Highly recommended for mathematically minded students.
    Principles of Data Mining D.J. Hand, H. Mannila, P. Smyth MIT Press, 2003 A nice gentle introduction to many areas of Dat Mining.
    Data Mining: Introductory and Advanced Topics M. H. Dunham Prentice Hall, 2003 Less mathematical and brief introduction to basic Data Mining techniques.
    Neural Networks for Pattern Recognition Christopher Bishop Clarendon Press, Oxford, 1995 You may need some sections of this book, particularly those on linear techniques (such as PCA) and generalisation.

    This page is maintained by Peter Tino