Kernel-Density-Based Clustering of Time Series Subsequences Using a Continuous Random-Walk Noise Model Anne Denton Department of Computer Science North Dakota State University Fargo, North Dakota 58105-5164, USA anne.denton@ndsu.edu Abstract Noise levels in time series subsequence data are typi- cally very high, and properties of the noise differ from those of white noise. The proposed algorithm incorporates a con- tinuous random-walk noise model into kernel-density-based clustering. Evaluation is done by testing to what extent the resulting clusters are predictive of the process that gener- ated the time series. It is shown that the new algorithm not only outperforms partitioning techniques that lead to trivial and unsatisfactory results under the given quality measure, but also improves upon other density-based algo- rithms. The results suggest that the noise elimination prop- erties of kernel-density-based clustering algorithms can be of signiﬁcant value for the use of clustering in preprocess- ing of data. 1. Introduction Finding patterns in time series subsequence data is a no- toriously difﬁcult problem. Standard clustering techniques, such as k-means and hierarchical clustering, result in clus- ters that are largely independent of the time series from which they originate [13]. Kernel-density-based clustering can lead to meaningful results, especially if an appropri- ate noise model is chosen [6]. Noise elimination in kernel- density-based clustering is based on the concept of a noise threshold, below which maxima in the density distribution are not considered as cluster centers [10]. Most time series data follow a noise distribution that differs from standard assumptions on randomness, and it can be beneﬁcial to in- corporate a more accurate noise distribution. While previ- ous work [6] assumed a discrete random-walk model, the current paper is based on a continuous model that is much more realistic in most settings. Time series clustering algorithms have been used di- rectly as pattern extraction algorithms [15], and as pre- processing step for further data mining [5]. Neither appli- cation relies on assigning a cluster to all subsequences. It can be expected that noise elimination as part of a cluster- ing process will increasingly become important in the pre- processing of data for classiﬁcation. While attribute selec- tion in classiﬁcation has traditionally been performed on the basis of classiﬁcation quality [14] this approach is vulnera- ble to the ”curse of dimensionality”, and does not scale well to a large number of attributes. It is, therefore, important to develop preprocessing techniques that are able to distin- guish between noise and meaningful data without using the class label. In many current data mining problems, objects are characterized by diverse data, that can include time se- ries data as well as other attributes. In this setting it is very important that attributes derived from a time series should contribute information that can be beneﬁcial to a classiﬁca- tion process, and noise elimination in clustering gains new importance. The current paper examines the time series subsequence clustering problem from a classiﬁcation perspective, where the class label is the correct identiﬁcation of the entire clus- tering rather than any individual cluster. Quality of pattern extraction is evaluated by testing to what extent the assign- ment to clusters can be used to identify the type of time series from which the clusters were derived. K-means and other partitioning methods produce a trivial and very poor result under this measure because a subsequence is guaran- teed to be assigned to some cluster even if the clustering was produced based on a different time series. In kernel- density-based clustering, subsequences may be identiﬁed as noise, and will then not be assigned to the respective time series. It will be shown that the ratio of correctly to in- correctly assigned subsequences can become very large for some models that discard a substantial number of subse- quences as noise. The paper is organized as follows. Section 2 introduces Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05) 1550-4786/05 $20.00 © 2005 IEEE