Relational clustering for knowledge discovery in life sciences

Created by W.Langdon from gp-bibliography.bib Revision:1.4221

  author =       "Ilaria Giordani",
  title =        "Relational clustering for knowledge discovery in life
  school =       "Universita degli Studi di Milano-Bicocca",
  year =         "2009",
  address =      "Italy",
  month =        oct,
  keywords =     "genetic algorithms, genetic programming, Relational
                 Clustering, Feature Selection, Knowledge integration,
                 Mixed data types",
  URL =          "",
  URL =          "",
  URL =          "",
  language =     "eng",
  size =         "144 pages",
  abstract =     "Clustering is one of the most common machines learning
                 technique, which has been widely applied in genomics,
                 proteomics and more generally in Life Sciences. In
                 particular, clustering is an unsupervised technique
                 that, based on geometric concepts like distance or
                 similarity, partitions objects into groups, such that
                 objects with similar characteristics are clustered
                 together and dissimilar objects are in different
                 clusters. In many domains where clustering is applied,
                 some background knowledge is available in different
                 forms: labelled data (specifying the category to which
                 an instance belongs); complementary information about
                 'true' similarity between pairs of objects or about the
                 relationships structure present in the input data; user
                 preferences (for example specifying whether two
                 instances should be in same or different clusters). In
                 particular, in many real-world applications like
                 biological data processing, social network analysis and
                 text mining, data do not exist in isolation, but a rich
                 structure of relationships subsists between them. A
                 simple example can be viewed in biological domain,
                 where there are al lot of relationships between genes
                 and proteins based on many experimental conditions.
                 Another example, maybe common, is the Web search domain
                 where there are relations between documents and words
                 in a text or web pages, search queries and web users.
                 Our research is focused on how this background
                 knowledge can be incorporated into traditional
                 clustering algorithms to optimise the process of
                 pattern discovery (clustering) between instances.",
  abstract =     "provide an overview of traditional clustering methods
                 with some important distance measures and then we
                 analyse three particular challenges that we try to
                 overcome with different proposed methods: 'feature
                 selection' to reduce high dimensional input space and
                 remove noise from data; 'mixed data types' to handle in
                 clustering procedure both numeric and categorical
                 values, typically of life science applications;
                 finally, 'knowledge integration' in order to improve
                 the semantic value of clustering incorporating the
                 background knowledge. Regarding the first challenge we
                 propose a novel approach based on using of genetic
                 programming, an evolutionary algorithm-based
                 methodology, in order to automatically perform feature
                 selection. Different clustering algorithms are been
                 investigated regarding the second challenge. A modify
                 version of a particular algorithm is proposed and
                 applied to clinical data. Particularly attention is
                 given to the final challenge, the most important
                 objective of this Thesis: the development of a new
                 relational clustering framework in order to improve the
                 semantic value of clustering taking into account in the
                 clustering algorithm relationships learnt from
                 background knowledge. We investigate and classify
                 existing clustering methods into two principal
                 categories: - Structure driven approaches: that are
                 bound to data structure. The data clustering problem is
                 tackled from several dimensions: clustering
                 concurrently columns and rows of a given dataset, like
                 biclustering algorithm or vertical 3-D clustering. -
                 Knowledge driven approaches: where domain information
                 is used to drive the clustering process and interpret
                 its results: semi-supervised clustering, that using
                 both labelled and unlabeled data, has attracted
                 significant attention. This kind of clustering
                 algorithms represents the first step to implement the
                 proposed general framework that it is classified into
                 this category. In particular the thesis focuses on the
                 development of a general framework for relational
                 clustering instantiating it for three different life
                 science applications: the first one with the aim of
                 finding groups of gene with similar behaviour respect
                 to their expression and regulatory profile. The second
                 one is a pharmacogenomics application, in which the
                 relational clustering framework is applied on a
                 benchmark dataset (NCI60) to identify a drug treatment
                 to a given cell line based both on drug activity
                 pattern and gene expression profile. Finally, the
                 proposed framework is applied on clinical data: a
                 particular dataset containing different information
                 about patients in anticoagulant therapy has been
                 analyzed to find group of patients with similar
                 behaviour and responses to the therapy.",
  notes =        "NCI60, Saccharomyces Genome Database, Oral
                 anticoagulation therapy Also known as

Genetic Programming entries for Ilaria Giordani