Potential Projects in 2016-17 for Final Year and MSc Students

Xin Yao, Office: Rm 211. Phone: x43747. Email: x.yao@cs.bham.ac.uk


General Description

My major research interests include natural computation, machine learning, computational Intelligence, meta-heuristic optimisation, and real-world applications. I have collaborated with many students over the years on various projects in these areas, for example,

In most cases, the approaches and techniques we used come from the fields of natural computation and computational intelligence, e.g., evolutionary algorithms, particle swarm optimisation, ant colony optimisation, differential evolution, neural networks, etc. All our projects involved both algorithm development, where independent critical thinking is essential, and software implementation, where programming and software development skills are critical. I am keen on projects that tackle real-world problems, or at least with a strong real-world application background, using natural computation and computational intelligence appraoches. In terms of specific topic, I am very happy to discuss with potential students and agree on one. Some illustrative examples are:

Specific Project Ideas

(1) Mining Data Streams Using Advanced Online Ensemble Learning

Online learning has been shown to be very useful for a growing number of applications in which training data are available continuously in time (streams of data) and/or there are time and space constraints. Examples of such applications are industrial process control, computer security, intelligent user interfaces, market-basket analysis, information filtering and prediction of conditional branch outcomes in microprocessors.

Ensembles of classifiers have been successfully used to improve the accuracy of single classifiers in online learning. However, online environments are often non-stationary and the variables to be predicted by the learning machine may change with time (concept drift). For example, in an information filtering system, the users may change their subjects of interest with time. So, learning machines used to model these environments should be able to adapt quickly and accurately to possible changes.

We consider that the term concept refers to the whole distribution of the problem in a certain point in time, being characterized by the joint distribution p(x,w), where x represents the input attributes and w represents the classes. So, a concept drift represents a change in the distribution of the problem.

Recently, a new online ensemble learning algorithm, called DDD, has been developed [1]. It has been shown to perform very well on a variety of different data streams. However, there are improvements that can be made to enhance the algorithm. This project investigates one such potential improvement during the training of individual base learners using bagging, and develops an easy-to-use tool fo it. The project includes both creative thinking in terms of ideas, as well as software engineering of a tool. The potential improvements will be provided to the project student.

Reference:

[1] L. L. Minku and X. Yao, "DDD: A New Ensemble Approach For Dealing With Concept Drift," IEEE Transactions on Knowledge and Data Engineering, 24(4):619-633, April 2012.

Notes:

(i) This project is suitable for students who have taken at least one of the following modules: Machine learning, Intelligent Data Analysis, and Neural Computation.

(ii) Programming skills are essential, although the current implementation of the algorithm is available. The student needs to understand the algorithm and the code so that his/her own modifications can be made.

(2) Small data challenge --- how to use other people's data when you don't have your own --- a case study in software effort estimation

There has been a long debate in the software engineering literature concerning how useful cross-company (CC) data are for software effort estimation (SEE) in comparison to within-company (WC) data. Studies indicate that models trained on CC data obtain either similar or worse performance than models trained solely on WC data. We aim at investigating if CC data could help to increase performance and under what conditions.

The work concentrates on the fact that SEE is a class of online learning tasks which operate in changing environments, even though most work so far has neglected that. We has recently developed a new online ensemble learning approach, called DCL, able to identify when CC data are helpful and how to make best use of them [1]. When we have multiple other companies, i.e., CC data, we will learn how useful another company's data are by adapting the weight assigned to a model of that company's data. This project will investigate different techniques that can improve upon the existing work [1] in adapting the weights. The project includes both creative thinking in terms of ideas, as well as software engineering of a tool.

Reference:

[1] L. L. Minku and X. Yao, ``Can Cross-company Data Improve Performance in Software Effort Estimation?'' Proc. of the 2012 Conference on Predictive Models in Software Engineering (PROMISE'12), Lund, Sweden, 21-22 September 2012. DOI: 10.1145/2365324.2365334.

Notes:

(i) This project is suitable for students who have taken at least one of the following modules: Machine learning, Intelligent Data Analysis, and Neural Computation.

(ii) Programming skills are essential, although the current implementation of the algorithm is available. The student needs to understand the algorithm and the code so that his/her own modifications can be made.

(3) How to Make Best Use of Other People's Data?

This project shares the same background as Project (2), but focusing on a different issue.

There has been a long debate in the software engineering literature concerning how useful cross-company (CC) data are for software effort estimation (SEE) in comparison to within-company (WC) data. Studies indicate that models trained on CC data obtain either similar or worse performance than models trained solely on WC data. We aim at investigating how CC data could best be used for a company.

We has recently developed a new online ensemble learning approach, called Dycom, able to map CC data into a company's context so that other people's data can be used by me [1]. The key to the success of this approach is how well this mapping is learned and how well it captures the essence of the relationship between two companies' data. In its current implementation [1], such a mapping is assumed to be linear for simplicity reasons. This is obviously not a realistic assumption, because the relationship between two companies' data are often nonlinear. This project will use the Dycom framework, but introduce nonlinear mappings int the algorithm. A tool for software effort estimation will be developed as part of the project. The project includes both creative thinking in terms of ideas, as well as software engineering of a tool.

Reference:

[1] L. Minku and X. Yao, "How to Make Best Use of Cross-company Data in Software Effort Estimation?" Proc. of the 36th International Conference on Software Engineering (ICSE'14), Hyderabad, India, 31/5-7/6/2014, pp.446-456, IEEE Press.

Notes:

(i) This project is suitable for students who have taken at least one of the following modules: Machine learning, Intelligent Data Analysis, and Neural Computation.

(ii) Programming skills are essential, although the current implementation of the algorithm is available. The student needs to understand the algorithm and the code so that his/her own modifications can be made.

(4) A multi-population scheme for dynamic evolutionary optimisation

In many real-world applications, we have to deal with dynamic optimization problems (DOPs) [1]. In DOPs, the environment, including the objective function, the decision variables, the problem instance, constraints and so on, may vary over time. When the changes take place, it may take some time for an evolutionary algorithm (EA) to adapt to the new environment. Due to this characteristic of DOPs, an EA designed for stationary optimization problems, in which the environment will not change at all, may no longer be efficient.

This project develops a new EA for dynamic evolutionary optimisation based on the idea of DDD [2], an online ensemble learning algorithm, where two ensembles with different diversity levels are maintained during learning. This project will investigate the use of multiple populations with different diversity levels in dynamic optimisation. One of the key issues is to decide when to set what diversity level for a given population. Although this project does involve software development, the main effort will be on developing and evaluating creative ideas.

Reference:

[1] T. T. Nguyen, S. Yang, J. Branke and X. Yao, "Evolutionary Dynamic Optimization: Methodologies," Chapter 2, in Evolutionary Computation for Dynamic Optimization Problems (eds. S. Yang and X. Yao), Studies in Computational Intelligence, Volume 490, 2013, pp 39-64.

[2] L. L. Minku and X. Yao, "DDD: A New Ensemble Approach For Dealing With Concept Drift," IEEE Transactions on Knowledge and Data Engineering, 24(4):619-633, April 2012.

Notes:

(i) This project is suitable for students who have taken the Nature Inspired Optimisation module.

(ii) Programming skills are important.

(5) A multi-model memory for online ensemble learning

This project shares the same background as Project (1), but focusing on a different issue.

Online learning has been shown to be very useful for a growing number of applications in which training data are available continuously in time (streams of data) and/or there are time and space constraints. Examples of such applications are industrial process control, computer security, intelligent user interfaces, market-basket analysis, information filtering and prediction of conditional branch outcomes in microprocessors.

Ensembles of classifiers have been successfully used to improve the accuracy of single classifiers in online learning. However, online environments are often non-stationary and the variables to be predicted by the learning machine may change with time (concept drift). For example, in an information filtering system, the users may change their subjects of interest with time. So, learning machines used to model these environments should be able to adapt quickly and accurately to possible changes.

We consider that the term concept refers to the whole distribution of the problem in a certain point in time, being characterized by the joint distribution p(x,w), where x represents the input attributes and w represents the classes. So, a concept drift represents a change in the distribution of the problem.

Recently, a new online ensemble learning algorithm, called DDD, has been developed [1]. It has been shown to perform very well on a variety of different data streams. This project explores the possibility to keep a memory of learned models that may be successfully revived at a later stage in case the concept drifts back to an earlier state. This includes the question of which models to memorise, and how to make best use of memorised models.

Reference:

[1] L. L. Minku and X. Yao, "DDD: A New Ensemble Approach For Dealing With Concept Drift," IEEE Transactions on Knowledge and Data Engineering, 24(4):619-633, April 2012.

Notes:

(i) This project is suitable for students who have taken at least one of the following modules: Machine learning, Intelligent Data Analysis, and Neural Computation.

(ii) Programming skills are essential, although the current implementation of the algorithm is available. The student needs to understand the algorithm and the code so that his/her own modifications can be made.

(6) Efficient Global Optimisation for dynamically changing environments

Efficient Global Optimisation (EGO) is a recently proposed optimisation algorithm particularly suited for problems where evaluating a solution is very expensive or time-consuming, and so the number of solutions that can be examined during optimisation is very small. Recently, [1] suggested different ways to modify EGO so that it is able to adapt to a dynamically changing environment. One of the ideas was to simply reduce the influence of old data points by making them artificially noisy. However, this introduced a new parameter, the noise level, that had to be specified by the user. In this project, ways to learn an appropriate noise level from the data shall be explored.

Reference:

[1] S. Morales-Enciso, J. Branke (2015): Tracking global optima in dynamic environments with efficient global optimization. European Journal of Operational Research, 242, pp. 744-755

Notes:

(i) This project is suitable for students who have taken the Nature Inspired Optimisation module.

(ii) Programming skills are important.

(7) Evolutionary Art

Evolutionary art refers to the innovative use of evolutionary computation ideas, e.g., evolutionary algorithms, in creating art through interactive evolution. The art form could include pictures, paitings, 3-d objects, music, etc. Evolutionary art has many practical applications. It could be used in designing fabric pattern [1] or other artefacts [2]. The primary aim of this project is to develop an interactive evolutionary art system as an App for an iPhone or Android phone. For example, we could have apps for evolving trees, evolving flowers, etc.

References:

[1] Y. Li, C. Hu and X. Yao, ``Innovative Batik Design with an Interactive Evolutionary Art System,'' Journal of Computer Science and Technology, 24(6):1035-1047, November 2009. http://www.cs.bham.ac.uk/~xin/papers/LiHuYaoJCST09.pdf

[2] Romero J, Machado P. The Art of Artificial Evolution: A Handbook on Evolutionary Art and Music. Springer Neitherlands, 2008.

Notes:

(i) This project is suitable for students who have taken the Nature Inspired Optimisation module.

(ii) Programming skills are essential.


Professor Xin Yao
School of Computer Science
The University of Birmingham
Edgbaston, Birmingham B15 2TT
U.K.
Phone: +44 121 414 3747
Fax: +44 121 414 4281
Email: x.yao@cs.bham.ac.uk