Efficient Similarity Search for Time Series Data Based on the Minimum Distance ⋆ Sangjun Lee, Dongseop Kwon, and Sukho Lee School of Electrical Engineering and Computer Science, Seoul National University Seoul 151-742, Korea {freude,subby}@db.snu.ac.kr shlee@cse.snu.ac.kr Abstract. We address the problem of efficient similarity search based ontheminimumdistanceinlargetimeseriesdatabases.Mostofprevious work is focused on similarity matching and retrieval of time series based on the Euclidean distance. However, as we demonstrate in this paper, the Euclidean distance has limitations as a similarity measurement. It is sensitive to the absolute offsets of time sequences, so two time sequences thathavesimilarshapesbutwithdifferentverticalpositionsmaybeclas- sified as dissimilar. The minimum distance is a more suitable similarity measurement than the Euclidean distance in many applications, where the shape of time series is a major consideration. To support minimum distance queries, most of previous work has the preprocessing step of vertical shifting that normalizes each time sequence by its mean before indexing. In this paper, we propose a novel and fast indexing scheme, called the segmented mean variation indexing(SMV-indexing). Our in- dexing scheme can match time series of similar shapes without vertical shifting and guarantees no false dismissals. Several experiments are per- formed on real data(stock price movement) to measure the performance of our indexing scheme. Experiments show that the SMV-indexing is more efficient than the sequential scanning in performance. 1 Introduction Time sequences are of growing importance in many database applications, such as data mining and data warehousing[1,2]. A time sequence is a sequence of real numbers and each number represents a value at a time point. Typical examples include stock price movement, exchange rate, weather data, biomedical mea- surement, etc. Similarity search in time series databases is essential, because it helps predicting, hypothesis testing in data mining and knowledge discovery[1,2]. Many techniques have been proposed to support the fast retrieval of similar time sequences based on the Euclidean distance[5,6,17]. However, the Euclidean dis- tance as a similarity measurement has the following problem: it is sensitive to the absolute offsets of time sequences, so two time sequences that have similar shapes but with different vertical positions may be classified as dissimilar. ⋆ This work was supported by the Brain Korea 21 Project in 2001 A. Banks Pidduck et al. (Eds.): CAISE 2002, LNCS 2348, pp. 377–391, 2002. c Springer-Verlag Berlin Heidelberg 2002