Kernel-Density-Based Clustering of Time Series Subsequences Using a
Continuous Random-Walk Noise Model
Anne Denton
Department of Computer Science
North Dakota State University
Fargo, North Dakota 58105-5164, USA
anne.denton@ndsu.edu
Abstract
Noise levels in time series subsequence data are typi-
cally very high, and properties of the noise differ from those
of white noise. The proposed algorithm incorporates a con-
tinuous random-walk noise model into kernel-density-based
clustering. Evaluation is done by testing to what extent the
resulting clusters are predictive of the process that gener-
ated the time series. It is shown that the new algorithm
not only outperforms partitioning techniques that lead to
trivial and unsatisfactory results under the given quality
measure, but also improves upon other density-based algo-
rithms. The results suggest that the noise elimination prop-
erties of kernel-density-based clustering algorithms can be
of significant value for the use of clustering in preprocess-
ing of data.
1. Introduction
Finding patterns in time series subsequence data is a no-
toriously difficult problem. Standard clustering techniques,
such as k-means and hierarchical clustering, result in clus-
ters that are largely independent of the time series from
which they originate [13]. Kernel-density-based clustering
can lead to meaningful results, especially if an appropri-
ate noise model is chosen [6]. Noise elimination in kernel-
density-based clustering is based on the concept of a noise
threshold, below which maxima in the density distribution
are not considered as cluster centers [10]. Most time series
data follow a noise distribution that differs from standard
assumptions on randomness, and it can be beneficial to in-
corporate a more accurate noise distribution. While previ-
ous work [6] assumed a discrete random-walk model, the
current paper is based on a continuous model that is much
more realistic in most settings.
Time series clustering algorithms have been used di-
rectly as pattern extraction algorithms [15], and as pre-
processing step for further data mining [5]. Neither appli-
cation relies on assigning a cluster to all subsequences. It
can be expected that noise elimination as part of a cluster-
ing process will increasingly become important in the pre-
processing of data for classification. While attribute selec-
tion in classification has traditionally been performed on the
basis of classification quality [14] this approach is vulnera-
ble to the ”curse of dimensionality”, and does not scale well
to a large number of attributes. It is, therefore, important
to develop preprocessing techniques that are able to distin-
guish between noise and meaningful data without using the
class label. In many current data mining problems, objects
are characterized by diverse data, that can include time se-
ries data as well as other attributes. In this setting it is very
important that attributes derived from a time series should
contribute information that can be beneficial to a classifica-
tion process, and noise elimination in clustering gains new
importance.
The current paper examines the time series subsequence
clustering problem from a classification perspective, where
the class label is the correct identification of the entire clus-
tering rather than any individual cluster. Quality of pattern
extraction is evaluated by testing to what extent the assign-
ment to clusters can be used to identify the type of time
series from which the clusters were derived. K-means and
other partitioning methods produce a trivial and very poor
result under this measure because a subsequence is guaran-
teed to be assigned to some cluster even if the clustering
was produced based on a different time series. In kernel-
density-based clustering, subsequences may be identified as
noise, and will then not be assigned to the respective time
series. It will be shown that the ratio of correctly to in-
correctly assigned subsequences can become very large for
some models that discard a substantial number of subse-
quences as noise.
The paper is organized as follows. Section 2 introduces
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE