Self-Adaptive Anytime Stream Clustering
Philipp Kranen
*
Ira Assent
†
Corinna Baldauf
*
Thomas Seidl
*
*
RWTH Aachen University, Germany
†
Aalborg University, Denmark
{kranen, baldauf, seidl}@cs.rwth-aachen.de ira@cs.aau.dk
Abstract—Clustering streaming data requires algorithms
which are capable of updating clustering results for the incom-
ing data. As data is constantly arriving, time for processing is
limited. Clustering has to be performed in a single pass over
the incoming data and within the possibly varying inter-arrival
times of the stream. Likewise, memory is limited, making
it impossible to store all data. For clustering, we are faced
with the challenge of maintaining a current result that can be
presented to the user at any given time.
In this work, we propose a parameter free algorithm that
automatically adapts to the speed of the data stream. It makes
best use of the time available under the current constraints
to provide a clustering of the objects seen up to that point.
Our approach incorporates the age of the objects to reflect
the greater importance of more recent data. Moreover, we
are capable of detecting concept drift, novelty and outliers in
the stream. For efficient and effective handling, we introduce
the ClusTree, a compact and self-adaptive index structure for
maintaining stream summaries. Our experiments show that
our approach is capable of handling a multitude of different
stream characteristics for accurate and scalable anytime stream
clustering.
Keywords-stream clustering, anytime algorithms, self-
adaptive algorithms
I. I NTRODUCTION
Analysis of streaming data is gaining importance as
sensors or other data gathering devices are widely deployed.
Streams constitute data values or tuples that need to be
processed as they arrive. With the wide applicability of
streaming data, clustering of streaming data has recently
received much attention in data mining research. The goal
is to cluster the objects within the stream continuously, such
that there is always an up-to-date clustering of all objects
seen so far. As opposed to clustering of a fixed data set
that is available entirely prior to the data mining analysis,
clustering of streaming data poses additional challenges.
Single pass clustering. In streaming environments, data
arrives continuously. This means that clustering streams has
to be performed in a single pass over the data in an online
fashion.
Limited memory. Since data streams are assumed to be
endless, storing each arriving object is simply not feasible.
Any streaming clustering model has to adhere to memory
constraints.
Limited time. The algorithm has to be able to keep up
with the speed of the data stream. Clustering of the data
cannot take longer than the average time between any two
objects in the stream. Clustering has to keep up with the
stream to always maintain a current clustering model.
Varying time allowances. Many streams do not show a
constant flow of data, but constitute bursty streams. This
means that the time available to process any item in the
stream may vary greatly. Examples include peak times
for customer transactions or seasonal changes in consumer
behavior. Existing stream clustering algorithms are not ca-
pable of handling such varying time allowances, unless
they were to resort to the minimal time allowance in the
stream. Clearly, this means downgrading to the worst case
assumption.
Evolving data. It is important to take into account that the
model underlying the data in the stream may change over
time. For example, consumption patterns during holidays
may differ from those that are seen the rest of the year.
To capture such phenomena, stream clustering should be
capable of clearly identifying such changes. Denoted as
concept drift, changes in clusters should be reported sep-
arately. Likewise, newly created clusters, so-called novelty,
and outliers should be detected as such.
Flexible number and size of clusters. Many clustering
algorithms, e.g. from the family of partitioning algorithms,
require parametrization of the number of clusters to be
detected. While setting such a parameter is also difficult
in traditional clustering, streams undergo changes that may
cause clusters to emerge, disappear, merge, or split. As such,
setting a fixed number of clusters for the stream would
distort the model. Existing stream clustering algorithms have
to fix the size of their model in advance, e.g. through a
maximum number of micro clusters [1], even though such
knowledge is usually not available apriori.
We propose a parameter free stream clustering algorithm
ClusTree that is capable of processing the stream in a single
pass, with limited memory usage. It always maintains an
up-to-date cluster model, and reports concept drift, novelty,
and outliers. For handling of varying time allowances,
we propose an anytime clustering approach. Anytime
algorithms denote approaches that are capable of delivering
a result at any given point in time, and of using more time if
it is available to refine the result. For clustering, this means
that our algorithm is capable of processing even very fast
streams, but also of using greater time allowances to refine
2009 Ninth IEEE International Conference on Data Mining
1550-4786/09 $26.00 © 2009 IEEE
DOI 10.1109/ICDM.2009.47
249