# Intelligent Data Analysis (Extended)

## Peter Tino

### Lecture Timetable and Handouts

Here is a preliminary outline of the module structure and lecture timetable. I will develop most of the ideas on whiteboard. You are strongly encouraged to take notes during the lectures. Some knowldege of maths is assumed, but more complicated notions will always be explained during the lectures. However, your maths should be better than this.

Week Lecture 1 Lecture 2
1 Module Administration. Overview of taught topics. Vectorial data. Basic concepts - vector spaces and matrix algebra.
2 Basic probability theory and statistics. [notes] Dimensionality reduction of vectorial data. Linear models. Principal Component Analysis I. [slides] [notes]
3 Principal Component Analysis II. [slides] [notes] [cov matrix example] Principal Component Analysis III. [slides] [notes]
4 Applications of PCA. [slides] [demo] data Document mining. [slides]
5 Representing documents. [slides] tfidf and information theory [slides]
6 Latent semantic indexing. [slides] Clustering. [slides]
7 On-line clustering. [slides] Constrained clustering. [slides]
8 Classification I. [demo] Classification II
9 Learnability and generalization. Assessing generalization, model selection.
10 Information retrieval. PageRank algorithm I. [slides] PageRank algorithm II. [slides]
11 Rank of page communities. [slides] Important topics not covered in this module.

 12 Revision Lecture Covering the Whole Module

### Unassessed Exercises

• Variance/Co-Variance calculations
• - Un-tar the file CovEx.tar and go to the folder "CovEx".
The folder contains seven 2-dimensional datasets dataN.dat, N=1,2,...,7, along with pictures of their distribution (dataN.jpg, N=1,2,...,7).
- For each dataset estimate the variances Var[X] and Var[Y] of the data along the two coordinates x and y, as well as the co-variance Cov[X,Y].
- Report the calculated estimates along with an explanation for differences among the estimated values for the 7 datasets.

• Principal Component Analysis
• - Do the assignment specified below for those taking "IDA Extended".

### Assignment for those taking "Intelligent Data Analysis Extended" (DEADLINE 1st March)

You can chose any data set(s) from the list bellow. Of course, finding your own dataset to investigate is much more prefarable!
If you decide to go the easier route and use some of the data sets provided, un-tar the relevant file and go to the corresponding folder.
The folder contains the data set, as well as additional information about the data. Read the available information, especially description of the features (data dimensions).
You will need to clean the data, so that it contains only numerical features (dimensions) and the features are space-separated (not comma-separated.

To make the plots informative, you should come up with a labelling scheme for data points.
If the data can be classified into several classes (find out in the data and feature description!), use that information as the basis for your labelling scheme. In that case exclude the class information from the data dimensions.
Alternatively, you can make labels out of any dimension, e.g. by quantising it into several intervals. For example, if the data dimension represents age of a person, you can quantise it into 5 labels (classes) [child, teenager, young adult, middle age, old].
Associate the data labels with different markers and use the markers to show what kind of data points get projected to different regions of the visualization plot (computer screen).

• Learn as much as you can about an assigned data set(s) using the visualization method developed in the module - PCA.
• Use various data labelling schemes.
• Compare PCA with straightforward co-ordinate projections.
• Data sets:
Before starting to work on the assignment, please carefully study the example I prepared using the boston database. Un-tar the file boston.ex.tar.gz and go to the folder "BOSTON.EX".
The subfolder "FIGURES" contains all the relevant figures as eps or gif files.

The report should describe experiments with a chosen data set(s) along the lines of `boston example'.
In the labeling scheme, concentrate on more than one coordinate (dimension), e.g. in the `boston example', consider not just the `price' feature, but run separate experiments with `per capita crime rate in the town', or 'pupil-teacher ratio in the town' instead of the `price' coordinate).

In the report concentrate on the following questions:
- How did you preprocess the data? (worth 20%)
- What features (coordinates) did you use for labeling the projected points with different markers? (worth 10%)
- How did you design the labeling schemes? (worth 20%)
- What interesting aspects of the data did you detect based on the data visualisations? (worth 30%)
- What interesting aspects of the data did you detect based on eigenvector and eigenvalue analysis of the data covariance matrix? (worth 20%)

You should demonstrate that you
- understand the visualisation techniques used
- are able to extract useful information about otherwise inconceivable high-dimensional data using dimensionality-reducing visualisation techniques.

The overall mark m (in the range 0-100%) will be linearly scaled to the range 0-20% by 0.2*m.

For examples of nice past reports developed on wine dataset (do not use this data in your report!), please see reports by Christoph Stich and Josephf Preece. Many thanks Chris and Joe!

### Data Visualisation using PCA and SOM

Try a Java applet for an interactive PCA/SOM created by Daoxiao Jin.

Source code (tar+gzip)

### Preparing for the exam - Sample Questions

• Principal Component Analysis (PCA)
Handouts: Principal Component Analysis
• - Why is it good to concentrate on variance/covariance structure of the multidimansional data?
- How can the variance/covariance structure be quantified? Define variance of a random variable and covariance of two random variables.
- How can the variance of a random variable X be estimated from a finite sample of realizations of X?
- Consider two random variables X,Y that form a two-dimensional vector random variable Z=(X,Y). How can the covariance between X and Y be estimated given a finite sample of realizations of Z?
- What is covariance matrix of a vector random variable?
- Define eigenvectors and eigenvalues of a square matrix A.
- Why are the eigenvectors/eigenvalues of covariance matrix important?
- Write down the PCA algorithm.
- How can you quantify the amount of lost information when projecting points onto a low-dimensional plane via PCA?
- What are the advantages/drawbacks of PCA?
• Data Clustering and Classification
Handouts: Topographic Maps of Vectorial Data
You need only slides 5-9.
• - What constitutes a classification problem?
- Explain the notion of model complexity and overfitting (overtraining) using the K-NN classifier.
- What is data clustering?
- Explain the on-line clustering algorithm.
- What are the drawbacks of such algorithms?
• Mining Textual Data
Handouts: Mining Textual Data
• - What are the specifics of textual data that make data mining challenging?
- What are the options for representing textual data items (e.g. text documents?
- Do we need a detailed grammar model when data mining of textual data? Why?
- What are the advantahes/disadvantages of using a vector space representation of documents?
- Explain the main ideas behind Latent Semantic Analysis (LSA).
- Write down the LSA algorithm.
- In what situations would you expect LSA to work well? Why?
- What are the drawbacks of LSA?
- Describe options for building a classifier for textual data.
- How can we uncover hidden structure among the terms (as opposed to documents)?
- Describe options for document retrieval using LSA.
• Searching the Web
Handouts: PageRank
• - Why do we need a good ranking algorithm in document retrieval?
- What are the main ideas behind the PageRank algorithm?
- How can standard PageRank algorithm be fooled (web spamming)?
- Why do we need to use an iterative approach to solve PageRank equations?
- Why do iterations of PageRank equations converge?
- Define the notion of energy of a web community.
- For a given web community, how can its energy be decomposed with respect to internal and external links to/from/within the community?
- What are dangling pages?
- Why should dangling pages be avoided?

### Problems to solve for those really interested

• Principal Component Analysis
• - Show how the need for eigen-decomposition of the data covariance matrix arises from minimizing the distance between the data points and their corresponding projections in a low-dimensional linear subspace
- What happens if there are two (or more) equal eigenvalues of the covariance matrix?
- In general, the eigenvaules can be complex numbers. Why are the eigenvalues of the covariance matrix always real numbers?
• Noah Gresham's talk

### Try the models out! - Benchmark data sets

As you get familiar with different types of models, try them out on benchmark data sets people in the machine learning community have been using to support their claims about yet another excellent learning system :-)

Here are two of the widely used data repisitories that contain data description, data itself and other useful things, like previously obtained results.

• DELVE - Data for Evaluating Learning in Valid Experiments
• UCI Knowledge Discovery in Databases Archive

Also, try these test collections that are used to measure the effectiveness of information retrieval and text mining systems. Again, the data repisitories contain data description, data itself and other useful things, like previously obtained results.

• Reuters-21578. The most widely used text categorization test collection.
• RCV1 (Reuters Corpus Volume 1). Recently released collection of news stories.
• TREC-AP. A text categorization task based on the Associated Press articles.
• 20 Newsgroups. 20000 messages taken from 20 Usenet newsgroups.

### Aims, Objectives, and Assessment

For formal details about the aims and objectives and assessment you should look at the official Module Syllabus Page [IDA] [IDA Extended].

IDA students will be assessed by examination (100%).
IDA Extended students will be assessed based on examination (80%) and a continuous assessment by mini-project report (20%).

### Recommended Books

The Recommended Books for this module are:

The Elements of Statistical Learning: Data Mining, Inference, and Prediction T. Hastile, R. Tibshirani, J. Friedman Springer, 2001 Comprehensive and cover many state-of-the-art statistical learning techniques and very helpful to understand the essence of Data Mining. Highly recommended for mathematically minded students.
Principles of Data Mining D.J. Hand, H. Mannila, P. Smyth MIT Press, 2003 A nice gentle introduction to many areas of Dat Mining.
Data Mining: Introductory and Advanced Topics M. H. Dunham Prentice Hall, 2003 Less mathematical and brief introduction to basic Data Mining techniques.
Neural Networks for Pattern Recognition Christopher Bishop Clarendon Press, Oxford, 1995 You may need some sections of this book, particularly those on linear techniques (such as PCA) and generalisation.