3rd/4th Year UG and MSc
Intelligent Data Analysis (Extended)
Course Material and Useful Links
P.Tino@cs.bham.ac.uk
Lecture Timetable and Handouts
Here is a preliminary outline of the module structure and lecture timetable.
I will develop most of the ideas on the blackboard.
You are encouraged to take notes during the lectures.
Any handouts used
will be made
available here as pdf files shortly after the paper versions have been
distributed. Some knowldege of maths is assumed, but more
complicated notions will always be explained during the lectures.
However, your maths should be better than
this.
|
Week |
Lecture 1
|
Lecture 2
|
| 1 |
Module Administration. Overview of taught topics.
|
Vectorial data. Basic concepts - vector spaces and matrix algebra.
|
| 2 |
Basic probability theory and statistics.
|
Dimensionality reduction of vectorial data. Linear models.
Principal Component Analysis I. [PDF]
|
| 3 |
Principal Component Analysis II. [PDF]
|
Principal Component Analysis III. [PDF]
|
| 4 |
Applications of PCA.
[PDF]
[DEMO]
data
|
Document mining.
[PDF]
|
| 5 |
Representing documents.
[PDF]
|
tfidf and information theory
[PDF]
|
| 6 |
Latent semantic indexing.
[PDF]
|
Clustering.
[PDF]
|
| 7 |
On-line clustering.
[PDF]
|
Constrained clustering.
[PDF]
|
| 8 |
Classification I.
[DEMO]
|
Classification II |
| 9 |
Learnability and generalization.
|
Assessing generalization, model selection.
|
| 10 |
Information retrieval. PageRank algorithm I.
[PDF]
|
PageRank algorithm II.
[PDF]
|
| 11 |
Rank of page communities.
[PDF]
|
Important topics not covered in this module.
|
| 12 |
Two Revision Lectures Covering the Whole
Module |
Useful Links
Suggested reading + software
-
Introductory probability theory and statistics
-
Principal Component Analysis
- Bishop: Section 8.6, Appendix E
- Hand, Mannila and Smyth: Sections 3.6 and 6.5.2
- Hastie, Tibshirani and Friedman: (part of) Section 3.4.3
- tutorial by Lindsay Smith:
soft introduction to PCA with elementary vector and
matrix algebras
-
Clustering/Vector Quantization
-
Classification
-
Document mining, Latent Semantic Analysis (LSA)
- Web page
devoted to LSI
-
Latent Semantic Analysis on Wikipedia
- S. Deerwester et al.:
Indexing by latent semantic analysis.
Journal of the American Society for Information Science, 6(41), pp. 391-407.
1999.
- F.R. Lopez, H. Jimenez-Salazar, D. Pinto:
A Competitive Term Selection Method for Information Retrieval.
Computational Linguistics and Intelligent Text Processing, Lecture
Notes in Computer Science Vol 4394, pp. 468-475. Springer, 2007.
- T. Hofmann:
Unsupervised Learning by Probabilistic Latent Semantic Analysis.
Machine Learning, 1-2(42), pp. 177-196.
2001.
- Y. Gong, X. Liu:
Generic text summarization using relevance measure and latent semantic analysis..
Proceedings of the 24th annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 19-25.
2001.
-
Searching the Web
- M. Bianchini, M. Gori, F. Scarselli:
Inside PageRank.
ACM Transactions on Internet Technology, 1(5), pp. 92-128.
2005.
- T.H. Haveliwala:
Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search.
IEEE Transactions on Knowledge and Data Engineering, 4(15), pp. 784-796.
2003.
- S. Kamvar et al.:
Extrapolation methods for accelerating PageRank computations.
Proceedings of the 12th international conference on World Wide Web,
pp. 261-270.
2003.
- S. Kamvar et al.:
Exploiting the Block Structure of the Web for Computing PageRank.
Technical report, Stanford University, 2003.
- A. Broder et al.:
Efficient PageRank approximation via graph aggregation.
Information Retrieval, 2(9), pp. 123-138.
2006.
Unassessed Exercises
Variance/Co-Variance calculations
- Un-tar the file
CovEx.tar and go to the folder
"CovEx".
The folder contains seven 2-dimensional datasets dataN.dat, N=1,2,...,7,
along with pictures of their distribution (dataN.jpg, N=1,2,...,7).
- For each dataset estimate the variances Var[X] and Var[Y] of the data
along the two coordinates x and y, as well as the co-variance Cov[X,Y].
- Report the calculated estimates along with an explanation for differences among the estimated values for the 7 datasets.
Principal Component Analysis
Assignment for those taking "Intelligent Data Analysis
Extended" (DEADLINE Tuesday of Week 1 of summer term, noon)
NOTE: This year, the requirement is to study the data using PCA only.
You can chose any data set(s) from the list bellow.
Un-tar the relevant file and go to the corresponding folder.
The folder contains the data set, as well as additional information
about the data. Read the available information, especially
description of the features (data dimensions).
You will need to clean the data, so that it contains
only numerical features (dimensions)
and the features are space-separated (not comma-separated.
To make the plots informative, you should
come up with a labelling scheme for data points.
If the data can be classified into several classes
(find out in the data and feature description!), use that information
as the basis for your labelling scheme.
In that case exclude the class information
from the data dimensions.
Alternatively, you can make labels out of any dimension,
e.g. by quantising it into several intervals. For example,
if the data dimension represents age of a person,
you can quantise it into 5 labels (classes)
[child, teenager, young adult, middle age, old].
Associate the data labels with different markers and
use the markers to show what kind of data points get projected
to different regions of the visualization plot
(computer screen).
-
Learn as much as you can about an assigned
data set(s) using visualization methods
developed in the module, namely PCA and SOM.
-
Use various data labelling schemes.
-
Compare more
complex visualization schemes with straightforward co-ordinate projection
methods.
Before starting to work on the assignment,
please carefully study the example
I prepared using
the boston database.
Un-tar the file
boston.ex.tar.gz and go to the folder
"BOSTON.EX".
The subfolder "FIGURES" contains all the relevant figures
as eps or gif files.
Please consult the "boston.read.me" file in BOSTON.EX.
The report should describe experiments with a chosen data set(s)
along the lines of `boston
example'.
In the labeling scheme,
concentrate on more than
one coordinate (dimension),
e.g. in the `boston example',
consider not just the `price' feature,
but run separate experiments
with `per capita crime rate in the town', or
'pupil-teacher ratio in the town' instead of the
`price' coordinate).
In the report concentrate on the following questions:
- How did you preprocess the data? (worth 20%)
- What features (coordinates) did you use for labeling the
projected points with different markers? (worth 10%)
- How did you design the labeling schemes? (worth 20%)
- What visualisation techniques did you use (e.g. PCA/coordinate projections/SOM)? (worth 10%)
- What interesting aspects of the data did you detect
based on the data visualisations? (worth 20%)
- What interesting aspects of the data did you detect
based on eigenvector and eigenvalue analysis of the data covariance matrix? (worth 20%)
You should demonstrate that you
- understand the visualisation techniques used
- are able to extract useful information about otherwise
inconceivable high-dimensional data using
dimensionality-reducing visualisation techniques.
The overall mark m (in the range 0-100%) will be linearly scaled to the range 0-20% by
0.2*m.
Data Visualisation using PCA and SOM
Try a
Java applet for an interactive PCA/SOM created by Daoxiao Jin.
Source code (tar+gzip)
Preparing for the exam - Sample Questions
-
Principal Component Analysis (PCA)
Handouts: Principal Component Analysis
- Why is it good to concentrate on variance/covariance structure of the
multidimansional data?
- How can the variance/covariance structure be quantified?
Define variance of a random variable and covariance
of two random variables.
- How can the variance of a random variable X be estimated from a finite
sample of realizations of X?
- Consider two random variables X,Y that form a two-dimensional
vector random variable Z=(X,Y). How can the covariance between
X and Y be estimated given a finite sample of realizations of Z?
- What is covariance matrix of a vector random variable?
- Define eigenvectors and eigenvalues of a square matrix A.
- Why are the eigenvectors/eigenvalues of covariance matrix important?
- Write down the PCA algorithm.
- How can you quantify the amount of lost information
when projecting points onto a low-dimensional plane via PCA?
- What are the advantages/drawbacks of PCA?
-
Data Clustering and Classification
Handouts: Topographic Maps of Vectorial Data
You need only slides 5-9.
- What constitutes a classification problem?
- Explain the notion of model complexity and overfitting (overtraining) using
the K-NN classifier.
- What is data clustering?
- Explain the K-means clustering algorithm.
- Explain the on-line clustering algorithm.
- What are the drawbacks of such algorithms?
-
Mining Textual Data
Handouts: Mining Textual Data
- What are the specifics of textual data that make data mining challenging?
- What are the options for representing textual data items (e.g. text documents?
- Do we need a detailed grammar model when data mining of textual data? Why?
- What are the advantahes/disadvantages of using a vector space representation
of documents?
- Explain the main ideas behind Latent Semantic Analysis (LSA).
- Write down the LSA algorithm.
- In what situations would you expect LSA to work well? Why?
- What are the drawbacks of LSA?
- Describe options for building a classifier for textual data.
- How can we uncover hidden structure among the terms (as opposed to documents)?
- Describe options for document retrieval using LSA.
-
Searching the Web
Handouts: PageRank
- Why do we need a good ranking algorithm in document retrieval?
- What are the main ideas behind the PageRank algorithm?
- How can standard PageRank algorithm be fooled (web spamming)?
- Why do we need to use an iterative approach to solve PageRank equations?
- Why do iterations of PageRank equations converge?
- Define the notion of energy of a web community.
- For a given web community, how can its energy be decomposed with respect to
internal and external links to/from/within the community?
- What are dangling pages?
- Why should dangling pages be avoided?
Problems to solve for those really interested
-
Principal Component Analysis
- Show how the need for eigen-decomposition of the data covariance
matrix arises from minimizing the distance between the data
points and their corresponding projections in a low-dimensional
linear
subspace
- What happens if there are two (or more) equal eigenvalues
of the covariance matrix?
- In general, the eigenvaules can be complex numbers. Why are the eigenvalues
of the covariance matrix always real numbers?
-
Noah Gresham's talk
Try the models out! - Benchmark data sets
As you get familiar with different types of models,
try them out on benchmark data sets people in the
machine learning community have been using to support
their claims about yet another excellent learning system :-)
Here are two of the widely used data repisitories that contain
data description, data itself and other useful things,
like previously obtained results.
DELVE -
Data for Evaluating Learning in Valid Experiments
UCI Knowledge Discovery in Databases Archive
Also, try these test collections
that are used to measure the effectiveness of information retrieval
and text mining systems.
Again, the data repisitories contain
data description, data itself and other useful things,
like previously obtained results.
Reuters-21578. The most widely used text categorization test collection.
RCV1 (Reuters Corpus Volume 1).
Recently released collection of news stories.
TREC-AP.
A text categorization task based on the Associated Press articles.
20 Newsgroups.
20000 messages taken from 20 Usenet newsgroups.
Aims, Objectives, and Assessment
For formal details about the aims and objectives and assessment you
should look at the official Module Syllabus Page
[IDA]
[IDA
Extended].
IDA students will be assessed by examination (100%).
IDA Extended students will be assessed based on examination (80%)
and a continuous assessment by mini-project report (20%).
Recommended Books
The Recommended Books for this module are:
| Title |
Author(s) |
Publisher, Date |
Comments |
| The Elements of Statistical Learning: Data Mining, Inference, and Prediction
|
T. Hastile, R. Tibshirani, J. Friedman |
Springer, 2001 |
Comprehensive and cover many state-of-the-art
statistical learning techniques and very helpful to understand
the essence of Data Mining.
Highly recommended for mathematically minded students. |
| Principles of Data Mining |
D.J. Hand, H. Mannila, P. Smyth |
MIT Press, 2003 |
A nice gentle introduction to many areas of Dat
Mining. |
| Data Mining: Introductory and Advanced Topics |
M. H. Dunham |
Prentice Hall, 2003 |
Less mathematical and brief introduction to basic
Data Mining techniques. |
| Neural Networks for Pattern Recognition |
Christopher Bishop |
Clarendon Press, Oxford, 1995 |
You may need some sections of this book,
particularly those on linear techniques (such as PCA) and generalisation. |
This page is maintained by
Peter Tino.
Last updated on 13 Feb 2009.