Unsupervised learning algorithm for time series using bivariate AR(1) model K. Vedavathi a, , K. Srinivasa Rao b,1 , K. Nirupama Devi b,2 a Department of Computer Science, GITAM University, Visakhapatnam 530 045, India b Department of Statistics, Andhra University, Visakhapatnam 530 003, India article info Keywords: Time series Clustering EM algorithm Autoregressive process abstract Currently, there is an increased interest in time series clustering research, particularly for finding useful similar time series in various applied areas such as speech recognition, environmental research, finance and medical imaging. Clustering and classification of time series has the potential to analyze large vol- umes of data. Most of the traditional time series clustering and classification algorithms deal only with univariate time series data. In this paper, we develop an unsupervised learning algorithm for bivariate time series. The initial clusters are found using K-means algorithm and the model parameters are esti- mated using the EM algorithm. The learning algorithm is developed by utilizing component maximum likelihood and Bayesian Information Criteria (BIC). The performance of the developed algorithm is eval- uated using real time data collected from a pollution centre. A comparative study of the proposed algo- rithm is made with the existing data mining algorithm that uses univariate autoregressive process of order 1 (AR(1)) model. It is observed that the proposed algorithm out performs the existing algorithms. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Learning algorithms play a dominant role in analyzing many practical situations such as in business, scientific research, policy making, and pollution control and monitoring. Usually, the obser- vations collected from these situations for analysis formulate time series. Time series clustering and classification provide useful information in forecasting and decision making. In seismology, Kakizawa, Shumway, and Taniguchi (1998) applied clustering techniques in order to establish the similarities or differences be- tween classes of events such as earthquakes and mining explo- sions. Several authors have presented various methodologies for clustering time series data. Kavitha and Punithavalli (2010) re- viewed the literature on clustering of time series. Maharaj (1999, 2000) extended an analogous testing procedure to the case of correlated univariate and multivariate stationary time series. The discrimination problem was also investigated as a model selection problem. A test of hypothesis to compare two stationary time series and classification procedure that uses this test of hypothesis to cluster stationary time series was discussed. Management of large data has created interest in time series clus- tering and discrimination (Ananthanarayana, Murty Narasimha, & Subramanian, 2001). Recently, Corduas and Piccolo (2008) have studied the time series clustering and classification using the auto- regressive metric and emphasized the need for time series cluster- ing and classification. At present, interest is focused on composite procedures which combine different statistical techniques to obtain more reliable classification, such as the algorithm for clustering financial time series proposed by Pattarin, Paterlini, and Minerva (2004), the method based on the use of functional analysis explored by Ingras- sia, Cerioli, and Corbellini (2003), and the clustering technique developed by Alonso, Berrendero, Hernandez, and Justel (2006) based on the full probability density of forecasts. An extensive re- view of the topic was illustrated by Warren Liao (2005). Bagnall and Janacek (2005) developed a procedure of clipping time series reduces memory requirements and significantly speeds up cluster- ing without decreasing clustering accuracy. Shiva Nagendra and Khare (2009) used the univariate time series models for forecasting hourly average of carbon monoxide (CO) concentration in the air during the critical (winter) period at two air quality control regions in Delhi. Yavuz and Ozyilmaz (2009) classified R5X4 type of HIV viruses using autoregressive model through artificial neural net- works (ANNs). Ibrahim, Zailan, Ismail, and Lola (2009) applied a Box–Jenkins ARIMA approach to model the time series of monthly maximum hourly carbon monoxide (CO) and nitrogen dioxide (NO 2 ) concen- trations in the east coast state of Peninsular Malaysia. Pamminger and Frahwrith-Schnatter (2010) discussed two approaches for model-based clustering of categorical time series based on time 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.11.030 Corresponding author. Tel.: +91 891 2564286; fax: +91 891 2790032. E-mail addresses: vedavathi_k@yahoo.com (K. Vedavathi), ksraoau@yahoo.co.in (K. Srinivasa Rao), knirupamadevi@gmail.com (K. Nirupama Devi). 1 Tel.: +91 891 2844650; fax: +91 891 2755547. 2 Tel.: +91 891 2844655; fax: +91 891 2755547. Expert Systems with Applications 41 (2014) 3402–3408 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa