Combination of Similarity Measures for Time Series Classiﬁcation using Genetic Algorithms Deepti Dohare and V. Susheela Devi Department of Computer Science and Automation Indian Institute of Science, India {deeptidohare, susheela}@csa.iisc.ernet.in Abstract—Time series classiﬁcation deals with the problem of classiﬁcation of data that is multivariate in nature. This means that one or more of the attributes is in the form of a sequence. The notion of similarity or distance, used in time series data, is signiﬁcant and affects the accuracy, time, and space complexity of the classiﬁcation algorithm. There exist numerous similarity measures for time series data, but each of them has its own disadvantages. Instead of relying upon a single similarity measure, our aim is to ﬁnd the near optimal solution to the classiﬁcation problem by combining different similarity measures. In this work, we use genetic algorithms to combine the similarity measures so as to get the best performance. The weightage given to different similarity measures evolves over a number of generations so as to get the best combination. We test our approach on a number of benchmark time series datasets and present promising results. I. I NTRODUCTION Time series data are ubiquitous, as most of the data is in the form of time series, for example, stocks, annual rainfall, blood pressure, etc. In fact, other forms of data can also be meaningfully converted to time series including text, DNA, video, audio, images, etc [1]. It is also evident that there has been a strong interest in applying data mining techniques to time series data. The problem of classiﬁcation of time series data is an interesting problem in the ﬁeld of data mining. The need to classify time series data occurs in broad range of real-world applications like medicine, science, ﬁnance, entertainment, and industries. In cardiology, ECG signals (an example of time series data) are classiﬁed in order to see whether the data comes from a healthy person or from a patient suffering from heart disease [2]. In anomaly detection, users’ system access activities on Unix system are monitored to detect any kind of abnormal behavior [3]. In information retrieval, different documents are classiﬁed into different topic categories which has been shown to be similar to time series classiﬁcation [4]. Another example in this respect is the classiﬁcation of signals coming either from nuclear explosions or from earthquakes, in order to monitor a nuclear test ban treaty [5]. Generally, a time series t = t 1 , ..., t r , is an ordered set of r data points. Here the data points, t 1 , ..., t r , are typically measured at successive point of time spaced at uniform time intervals. A time series may also carry a class label. The problem of time series classiﬁcation is to learn a classiﬁer C , which is a function that maps a time series t to a class label l, that is, C (t)= l where l ∈ L, the set of class labels. The time series classiﬁcation methods can be divided into three large categories. The ﬁrst is the distance based clas- siﬁcation method which requires a measure to compute the distance or similarity between pairs of time sequences [6]–[8]. The second is the feature based classiﬁcation method which transforms each time series data into a feature vector and then applies conventional classiﬁcation method [9], [10]. The third is the model based classiﬁcation methods where a model such as Hidden Markov Model (HMM) or any other statistical model is used to classify time series data [11], [12]. In this paper, we consider the distance based classiﬁcation method where the choice of the similarity measure affects the accuracy, as well as the time and the space complexity of classiﬁcation algorithms [6]. There exist some similarity measures for time series data, but each of them has their own disadvantages. Some well known similarity measures for time series data are Euclidean distance, Dynamic time warping distance (DTW), Longest Common Subsequence (LCSS) etc. We introduce a similarity based time series classiﬁcation algo- rithm that uses the concept of genetic algorithms. One nearest neighbor (1NN) classiﬁer has often been found to perform better than any other method for time series classiﬁcation [7]. Due to the effectiveness and the simplicity of 1NN classiﬁer, we focus on combining different similarity measures into one and use the resultant similarity measure with 1NN classiﬁer. The paper is organized as follows: We present a brief survey of the related work in Section II. We formally deﬁne our problem in Section III. In Section IV, we describe the proposed genetic approach for the time series classiﬁcation. Section V presents the experimental evaluation. Results are shown in Section VI. Finally, we conclude in Section VII. II. RELATED WORK AND MOTIVATION We begin this section with a brief description of the dis- tance based classiﬁcation method. The distance based method requires a similarity measure or a distance function, which is used with some existing classiﬁcation algorithms. In the current literature, there are over a dozen distance measures for ﬁnding the similarity of time series data. Although many algorithms have been proposed providing a new similarity measure as a subroutine to 1NN classiﬁer, it has been shown 401 978-1-4244-7835-4/11/$26.00 ©2011 IEEE