[]

Machine Learning – 2011, Autumn term 

[]

Induction slides

Lecture handouts, tutorial worksheets and practicals

1. Basic Notions of Learning, Introduction to Learning Algorithms, Literature

 

2. Probabilistic Models of Sequences.   Worksheet   (model answers included in the RL1 model answers)

You can read more about the Web link prediction example here.

 

3. Reinforcement Learning I.   Worksheet1 Solutions

 

4. Reinforcement Learning II.   Worksheet2 Solutions


5. Probabilistic Latent Semantic Analysis  [check related Practical Assignment 3 below - code provided!]

 

6. Instance-Based Learning (k-nearest neighbour, Case-based Reasoning)  Worksheet5      Solutions
 

7. Decision Tree Learning    Worksheet4  Solutions          [check related Practical Assignment below - no programming]

8. Intro to Bayesian Learning: Bayesian classificationWorksheet3 Solutions

 

9. Independent Component Analysis. The cocktail party problem (demo). [check related Practical Assignment 4 below – code provided!]

 

10. Clustering and Visual Data Analysis   K-means MatLab code: K-means, calling script, distance function, another function needed

 

11. Support vector machines Worksheet Solutions

 

12. Genetic Algorithms

Continuous assessment (ML: 20% of the overall mark; ML-Extended: 40% of the overall mark) CA marks

Types of assignments:

1) There are exercises set on Worksheets (see as above). The deadline for handing in solutions for any of these is before the next class. We usually solve these exercises in the tutorial classes, hence the tight deadline.

Note that Worksheet-exercise type of questions may appear in the exam, so it is highly recommended that you put in some effort in trying to solve them yourself.

Easiest way to hand in: directly to me before the class.

2) There are also practical exercises set for you (see below), which may involve some programming work and using machine-learning techniques. For these works, the deadline is the end of term.
These assignments allow you to ‘get your hands dirty’ and gain experience with how these methods work. Please give them a try before complaining about the module being too ‘theoretical’.

 

Choices and restrictions on continuous assessment

ML: Most exercises are worth 5%, so for gathering 20% you will need to complete 4 pieces of work.

ML-EXTENDED: to get 40%, you will need to get 8 pieces of work done. Of these, 20% (equivalent 4 pieces) MUST be chosen from those marked with “EXT”.

ALL: You can choose which pieces of work you want to put forth for marking. To make most out of this module you should try each exercise. I will mark all the work that you submit and will take the best 4 (8 for ‘extended’) marks from those.

 

Feedback

You get feedback straight away on your efforts of solving the Worksheet exercises, as we solve them in the class.

You will also get feedback by getting your marked work returned to you within 2 weeks. In you miss to collect your marked work in the class then you should to come to my office (in my office hour) to collect it.

In addition, ask me questions any time during lectures, tutorials and office hours on anything you found unclear. I am happy to help those of you who are interested to learn. But please prepare concrete questions. Don’t expect me to repeat a whole lecture or to solve your exercises!

 

Important note
For homework problems or programming assignments you are allowed to discuss the problems or assignments verbally with other class members, but under no circumstances can you look at or copy anyone else's written solutions or code relating to homework problems or programming assignments. All problem solutions submitted must be material you have personally written during this term. Failure to adhere to this policy can result in a student receiving a failing grade in the class.

 

Handing in procedure
Submit your work on paper. You can hand in any time during the term, until the last day of the term. Don’t forget to put your student ID on it. 

Practical Assignment 1 [EXT]. [a) 5% b) 5% c) 5%] Prediction using probabilistic sequence models.

 Download the data (in MatLab .mat format) that you can use for this exercise. The same data is also provided in text format (file1 file2) in case you want to use another programming language of your choice. Alternatively you can use any symbolic sequence data set you like. If you go for the last option, please consult me first.

Practical Assignment 2 [EXT]. [5%] Write a program that implements Q-learning in non-deterministic environments, for finding the optimal action plan for you for the situation described onWorksheet2 above. You may use your favorite programming language. Write up cca. 2 pages about the data structures you have used, the way you have chosen the actions during learning as well as your results obtained by running your program. In particular, the following are of interest to show:
- Plot the evolution of (s,a) values (Q-values) against iterations in any form you find suggestive so as to show the convergence of the algorithm.
- On a separate figure, plot the evolution of the cumulative reward against iterations.
- Give the Q-table obtained after convergence.
- Comment on all these plots, i.e. explain in words what you see from these figures.
Hand in your write-up, not your code.

Practical Assignment 3 – Probabilistic Latent Semantic Analysis [5%] I prepared a term by document matrix of a subset of the 20Newsgroups text collection for you together with the associated dictionary of terms: data file (100 terms x 348 docs); dictionary file (the 100 terms listed). Use my MatLab implementation of PLSA (or alternatively implement your own in your favorite programming language) to seek 4 topics in this data. Use the relevant parameters returned by the algorithm to list the 10 most probable words that characterize these topics. Try also to search for 5 or more topics. Write up your findings (1-2 pages). What topics can you identify in this document collection? How is the presence of these topics distributed across the documents?

You may find the following useful, if you chose to solve this exercise in MatLab:

- Have a look at the parameter arrays involved. An array (matrix) M can be plotted e.g. like this:

>> imagesc(1-M);colormap gray

- For loading in the dictionary file '4news_dictionary.txt' into the MatLab workspace, use the following:

>> [ignore, terms]=textread('4news_dictionary.txt','%d %s',-1);

- When writing your code to list the most probable words in each topic, be aware that MatLab has a built in function 'sort' for sorting, that you can use without having to code it from scratch. Type 'help sort' to find out more about this function.

For your convenience, I have added a small demo MatLab script which calls PLSA and plots the variables involved. Here is a pretty figure obtained by running this demo that illustrates the workings of the algorithm. It shows how the initial terms x docs matrix is decomposed into the product of a topics x terms matrix and a documents by topics matrix.

Practical Assignment 4 – ICA.

a) [5%] Download the FastICA toolbox. Generate 4 signals using the ‘demosig’ function included in the toolbox. Retain 2 of those signals only. Generate a random linear mixture. In MatLab, this is accomplished with the commands below.

>>s=demosig; ind=[index1, index2]; s=s(ind,:);

>>A=rand(2); x=A*s;

Then use the software to try to recover the two signals s. Repeat the experiment a few times, using different 2 signals (out of the overall 4) and each time testing different nonlinearities g (out of those pre-set in the sw). Record and report on at least one case [i.e. indexes of the 2 initial signals used and the g used] where the signal separation was consistently successful and one other where it wasn’t.

b) [EXT] [5%] Re the previous question, explain *why*. (To answer this, you would need to read around the subject, starting from the tutorial and links given on the last page of the handout and become familiar with basic statistical issues involved.)

 

 

Introduction to MatLab


Many Machine Learning methods and algorithms are readily implemented and downloadable.  As new solutions are continuously being designed, most of this stuff is written in MatLab. MATLAB® is a high-performance language for technical computing. It is easy to use. The name MATLAB stands for matrix laboratory. This is because in MatLab the basic type is the matrix. A scalar number is just a 1 x 1 matrix! You can do operations with matrices in a single line of code.

To use MatLab, login into your unix account and start MatLab by typing

 

matlab –nojvm

You will get a prompt like this:

>> 

To get general help, type

>> help

To get help on a <command>, type

>> help <command>

To quit, type

>> quit

 
Here is a simple MatLab tutorial. It contains all you need for this module.
http://www.cyclismo.org/tutorial/matlab/
Here is another one -- in case you are still not convinced that MatLab is easy.
http://users.rowan.edu/~shreek/networks1/matlabintro.html

Here is a complete online MatLab help
http://www.mathworks.com/access/helpdesk/help/techdoc/learn_matlab/learn_matlab.html

Here is the MatLab manual. This is a 184 pgs book (especially chapters 2 and 4).
http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/getstart.pdf

 

By the end of the practical work in this module you will be able to:
- use Machine Learning methods available as MatLab programs
- know how to apply them to real data analysis problems
- know how to look for help on a program or on a method