Combination of Similarity Measures for Time Series Classification using Genetic Algorithms Deepti Dohare and V. Susheela Devi Department of Computer Science and Automation Indian Institute of Science, India {deeptidohare, susheela}@csa.iisc.ernet.in Abstract—Time series classification deals with the problem of classification of data that is multivariate in nature. This means that one or more of the attributes is in the form of a sequence. The notion of similarity or distance, used in time series data, is significant and affects the accuracy, time, and space complexity of the classification algorithm. There exist numerous similarity measures for time series data, but each of them has its own disadvantages. Instead of relying upon a single similarity measure, our aim is to find the near optimal solution to the classification problem by combining different similarity measures. In this work, we use genetic algorithms to combine the similarity measures so as to get the best performance. The weightage given to different similarity measures evolves over a number of generations so as to get the best combination. We test our approach on a number of benchmark time series datasets and present promising results. I. I NTRODUCTION Time series data are ubiquitous, as most of the data is in the form of time series, for example, stocks, annual rainfall, blood pressure, etc. In fact, other forms of data can also be meaningfully converted to time series including text, DNA, video, audio, images, etc [1]. It is also evident that there has been a strong interest in applying data mining techniques to time series data. The problem of classification of time series data is an interesting problem in the field of data mining. The need to classify time series data occurs in broad range of real-world applications like medicine, science, finance, entertainment, and industries. In cardiology, ECG signals (an example of time series data) are classified in order to see whether the data comes from a healthy person or from a patient suffering from heart disease [2]. In anomaly detection, users’ system access activities on Unix system are monitored to detect any kind of abnormal behavior [3]. In information retrieval, different documents are classified into different topic categories which has been shown to be similar to time series classification [4]. Another example in this respect is the classification of signals coming either from nuclear explosions or from earthquakes, in order to monitor a nuclear test ban treaty [5]. Generally, a time series t = t 1 , ..., t r , is an ordered set of r data points. Here the data points, t 1 , ..., t r , are typically measured at successive point of time spaced at uniform time intervals. A time series may also carry a class label. The problem of time series classification is to learn a classifier C , which is a function that maps a time series t to a class label l, that is, C (t)= l where l L, the set of class labels. The time series classification methods can be divided into three large categories. The first is the distance based clas- sification method which requires a measure to compute the distance or similarity between pairs of time sequences [6]–[8]. The second is the feature based classification method which transforms each time series data into a feature vector and then applies conventional classification method [9], [10]. The third is the model based classification methods where a model such as Hidden Markov Model (HMM) or any other statistical model is used to classify time series data [11], [12]. In this paper, we consider the distance based classification method where the choice of the similarity measure affects the accuracy, as well as the time and the space complexity of classification algorithms [6]. There exist some similarity measures for time series data, but each of them has their own disadvantages. Some well known similarity measures for time series data are Euclidean distance, Dynamic time warping distance (DTW), Longest Common Subsequence (LCSS) etc. We introduce a similarity based time series classification algo- rithm that uses the concept of genetic algorithms. One nearest neighbor (1NN) classifier has often been found to perform better than any other method for time series classification [7]. Due to the effectiveness and the simplicity of 1NN classifier, we focus on combining different similarity measures into one and use the resultant similarity measure with 1NN classifier. The paper is organized as follows: We present a brief survey of the related work in Section II. We formally define our problem in Section III. In Section IV, we describe the proposed genetic approach for the time series classification. Section V presents the experimental evaluation. Results are shown in Section VI. Finally, we conclude in Section VII. II. RELATED WORK AND MOTIVATION We begin this section with a brief description of the dis- tance based classification method. The distance based method requires a similarity measure or a distance function, which is used with some existing classification algorithms. In the current literature, there are over a dozen distance measures for finding the similarity of time series data. Although many algorithms have been proposed providing a new similarity measure as a subroutine to 1NN classifier, it has been shown 401 978-1-4244-7835-4/11/$26.00 ©2011 IEEE