The 1st International Workshop on High Dimensional
Data Mining (HDM)
In
conjunction with the IEEE International
Conference on Data Mining (IEEE ICDM 2013) in Dallas, Texas.
Schedule
8:45 -
Opening: Welcome & Introduction by Ata Kaban
9:00 - Invited Talk: Jo Etzel, "High dimensional data case study: fMRI
MVPA"
10:00 - coffee break
10:30 - Invited Talk: Bob Durrant, "The
Unreasonable Effectiveness of Random Projections in Computer Science"
12:00 - lunch break
13:30 - Invited Talk: Stephan Gunnemann, "Advanced
Subspace Clustering Techniques"
14:30 - Peng Jiang
and Michael T Heath, "Pattern Discovery in High Dimensional Binary
Data"
15:00 - coffee break
15:30 - Michael E. Houle, "Dimensionality,
Discriminability, Density & Distance Distributions"
16:00 - Arthur Flexer and Dominik
Schnitzer, "Can Shared Nearest Neighbors Reduce Hubness in
High-Dimensional Spaces? "
16:30 - Jiangbo
Yuan and Xiuwen Liu, "Transform Residual K-means
Trees for Scalable Clustering"
17:00 - Ata Kaban, "A New Look at Compressed
Ordinary Least Squares"
17:30 - 18:00 Discussion & Closing
Invited
Speakers
Title: "The
Unreasonable Effectiveness of Random Projections in Computer Science"
Abstract:
Random
projections have been used as a dimensionality reduction technique for large
data since the appearance of Arriaga and Vempala’s seminal FOCS 1999 paper, and they continue to
find applications both within the field of Machine Learning and elsewhere. Starting
with some motivating examples from Machine Learning and Data Mining, this talk
will review some key theoretical properties of random projections, and the
practical applications this theory has inspired. In particular, we will cover
the Johnson-Lindenstrauss lemma which gives
conditions under which random projections approximately preserve geometric
properties of data and give related applications, discuss the field of
Compressed Sensing from which one can derive guarantees for techniques working
with sparse data which we illustrate in the setting of SVM, and discuss more
recent theory giving guarantees for linear classifiers and regressors
working with non-sparse data. Finally we consider the use of random projections
as an efficient method for regularizing rank-deficient covariance and precision
matrix estimates, and we describe two novel applications of this approach to
classification and unconstrained large-scale continuous optimization.
Short Bio:
Bob
Durrant is a lecturer in the Department of Statistics
at the University of Waikato, New Zealand, where he is
also a member of the Machine Learning group.
He received his BSc degree in Mathematics from the Open University UK, and his
MSc and PhD degrees in Computer Science from the University of Birmingham,
supervised by Ata Kaban. He defended his PhD thesis
in April 2013 when his examiners were John Shawe-Taylor
and Peter Tino.
Bob's recent research has been mostly focused on the effects of random
projections on classifier performance, where his main contributions have been
theoretical results focusing on linear discriminants and kernel linear
discriminants published at KDD 2010, ICPR 2010 (Prize-winning paper: Best
Student Paper in Pattern Recognition and Machine Learning track), Pattern
Recognition Letters 2011 (Invited paper), AIStats
2012, and ICML 2013. He has a side interest in optimization problems, and his
novel application of random projections to large-scale continuous optimization
won the Best Paper award in the EDA track at GECCO 2013. An invited extension
of this work is currently under review for the MIT Journal Evolutionary
Computation.
Bob's current research is focused on explaining the utility of classifier
ensembles in settings where the number of data features is much greater than
the number of training examples; published preliminary work in this direction
won the Best Paper award at ACML 2013.
Title:
"High
dimensional data case study: fMRI MVPA"
Abstract:
fMRI (functional magnetic
resonance imaging) studies present researchers with many "interesting"
aspects: datasets typically include tens of thousands of features, measured at
dozens to thousands of time points, but often only a few tens of examples.
Additionally, the data has an incredibly complex structure, including multiple
spatial and temporal dependencies, on top of the usual complications involved
in studies of human behavior. Nevertheless, machine
learning and data mining techniques have successfully extracted information
from these datasets, work often called "brain decoding",
"MVPA" (multi-voxel pattern analysis), or even "mind
reading".
In this talk I will
introduce fMRI data from an analyst's viewpoint, with the goal of providing a
foundation for understanding (and hopefully becoming excited by) this
particular application. I will describe actual datasets, typical approaches,
and outline open issues for the field, including below-chance accuracy and
significance testing.
Short
Bio:
Jo Etzel completed a PhD in Bioinformatics and Computational
Biology at Iowa State University under the supervision of Julie Dickerson and
Ralph Adolphs, then a postdoc under Christian Keysers at the Social Brain Lab, University Medical Center Groningen (The Netherlands). Since 2010 Jo has
worked as a Research Analyst in the Psychology Department at Washington
University in St. Louis, primarily with the groups of Todd Braver, Jeff Zacks, and Deanna Barch. Her
research interests are focused on methodology, particularly multivariate
analyses of fMRI data, but also nonparametric statistics and
psychophysiological measures. Jo blogs about fMRI analysis at
mvpa.blogspot.com.
Title: "Advanced
Subspace Clustering Techniques"
Abstract:
Clustering is one of
the core data mining tasks and aims at grouping similar objects while
separating dissimilar ones. Since in today's applications usually many
characteristics for each object are recorded, one cannot expect to find similar
objects by considering all attributes together. As a general solution to this
problem, the paradigm of subspace clustering has been introduced. It aims at
detecting clusters hidden in locally relevant subspace projections of the data.
In this talk, I will
discuss novel methods for effective subspace clustering that tackle important
challenges such as redundancy handling and the task of detecting multiple
overlapping clusterings. Besides presenting methods
for the case of vector data, I will introduce techniques which extend the subspace
clustering paradigm to the domain of network data.
Short Bio:
Stephan
Gunnemann is a Postdoctoral researcher at the
Department of Computer Science, Carnegie Mellon University, USA. From 2008 to
2012, he has been a research associate at the data management and data
exploration group at RWTH Aachen University, Germany. Dr.
Gunnemann received his PhD in Computer Science from
RWTH Aachen University in 2012. His research interests include graph mining and
the mining of non-redundant, multiple clustering solutions for high dimensional
and relational data.
Description
of Workshop
Some
13 years ago, Stanford statistician D. Donoho
predicted that the 21st century will be the century of data.
"We can say with complete confidence that in the coming century,
high-dimensional data analysis will be a very significant activity, and
completely new methods of high-dimensional data analysis will be developed; we
just don't know what they are yet." -- D. Donoho,
2000.
Indeed, unprecedented technological advances lead to increasingly high
dimensional data sets in all areas of science, engineering and businesses.
These include genomics and proteomics, biomedical imaging, signal processing,
astrophysics, finance, web, and market basket analysis, among many others. The number
of features in such data is often of the order of thousands or millions -- that
is much larger than the available sample size. This renders classical data
analysis methods inadequate, questionable, or inefficient at best, and calls
for new approaches.
Some of the manifestations of this curse of dimensionality are the following:
- High dimensional
geometry defeats our intuition rooted in low dimensional experiences so that
data presentation and visualisation become particularly challenging.
- Distance concentration
is the phenomenon of high dimensional probability spaces where the contrast
between pairwise distances vanishes as the dimensionality increases -- this
makes distances meaningless, and affects all methods that rely on a notion of
distance.
- Bogus correlations
and misleading estimates may result when trying to fit complex models for which
the effective dimensionality is too large compared to the number of data points
available.
- The accumulation of
noise may confound our ability to find low dimensional intrinsic structure
hidden in the high dimensional data.
- The computation cost
of processing high dimensional data is often prohibiting.
Topics
This
workshop aims to promote new advances and research directions to address the
curses, and to uncover and exploit the blessings of high dimensionality in data
mining. Topics of interest range from theoretical foundations, to algorithms
and implementation, to applications and empirical studies of mining high
dimensional data, including (but not limited to) the following:
o
Systematic
studies of how the curse of dimensionality affects data mining methods
o
New
data mining techniques that exploit some properties of high dimensional data
spaces
o
Theoretical
underpinnings of mining data whose dimensionality is larger than the sample
size
o
Stability
and reliability analyses for data mining in high dimensions
o
Adaptive
and non-adaptive dimensionality reduction for noisy high dimensional data sets
o
Methods
of random projections, compressed sensing, and random matrix theory applied to
high dimensional data mining
o
Models
of low intrinsic dimension, such as sparse representation, manifold models,
latent structure models, and studies of their noise tolerance
o
Classification,
regression, clustering of high dimensional complex data sets
o
Functional
data mining
o
Data
presentation and visualisation methods for very high dimensional data sets
o
Data
mining applications to real problems in science, engineering or businesses
where the data is high dimensional
Paper
submission
High
quality original submissions are solicited for oral and poster presentation at
the workshop. Papers should not exceed a maximum of 8 pages, and must follow
the IEEE ICDM format
requirements of the main conference. All submissions will be peer-reviewed,
and all accepted workshop papers will be published in the proceedings by the
IEEE Computer Society Press. Submit your paper here.
Important
dates
Submission
deadline: August 17, 2013
Notifications
to authors: September 24, 2013.
Workshop:
December 7, 2013.
Programme
committee
Adam
Kowalczyk - Victoria Research Laboratory, NICTA,
Australia
Arthur
Zimek - LMU Munich, Germany
Barbara
Hammer - Clausthal University of Technology, Germany
Ata
Kaban - University of Birmingham, UK
John
A. Lee - Universite Catholique
de Louvain, Belgium
Laurens
van der Maaten - Delft University of Technology, The Netherlands
Mark Last - University of the Negev, Israel
Milos
Radovanovic - University of Novi Sad, Serbia
Pierre
Alquier - University College Dublin, Ireland
Robert
J. Durrant - University of Waikato, NZ
Stephan
Gunnemann - Carnegie Mellon University
Yiming Ying - University of Exeter, UK
Workshop
organiser
School
of Computer Science, University of Birmingham, UK
Related links & resources: Analytics, Big Data, Data Mining, & Data
Science Resources