Data clustering based on correlation analysis applied to highly variable domains Stefania Tosi ⇑ , Sara Casolari, Michele Colajanni Department of Information Engineering, University of Modena and Reggio Emilia, Italy article info Article history: Available online 14 July 2013 Keywords: Trafﬁc clustering High variability Correlation index Network management abstract Clustering of trafﬁc data based on correlation analysis is an important element of several network management objectives including trafﬁc shaping and quality of service control. Existing correlation-based clustering algorithms are affected by poor results when applied to highly variable time series characterizing most network trafﬁc data. This paper proposes a new similarity measure for computing clusters of highly variable data on the basis of their correlation. Experimental evaluations on several synthetic and real datasets show the accuracy and robustness of the proposed solution that improves existing clustering methods based on statistical correlations. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Clustering is a widely adopted approach for augmenting the level of knowledge on rough data. The goals of clustering applied to computer and network datasets can be different, going from Web sites characterization [1], classiﬁcation of users navigation patterns [4], network traf- ﬁc classiﬁcation and management [2]. For example, many network management goals such as ﬂow prioritization, trafﬁc shaping and policing, and diagnostic monitoring as well as many network engineering problems, such as workload characterization and modeling, capacity plan- ning, and route provisioning may beneﬁt from trafﬁc clus- tering [2]. In this paper, we are interested in correlation-based clustering algorithms applied to highly variable time ser- ies. This set of algorithms (e.g., Pearson product moment [7], Spearman and Kendall ranks [8,9]) consider that time series are similar if they exhibit some degree of statistical inter-dependency, and differ from other popular ap- proaches using some geometrical distance (e.g., Euclidean distance [6], cosine distance [5]) as their similarity measure. The reason of focusing on correlation similarity measures is distance functions are not always adequate in capturing dependencies among the data. In fact, strong dependencies may exist between time series even if their data samples are far apart from each other as measured by distance functions [3]. In the next section, we will support this statement through a network related example. The choice and the performance of the similarity mea- sure impact the quality of any clustering algorithm. The better the accuracy and robustness of the measure in ﬁnd- ing similarity, the better the quality of the clustering mod- el. Existing correlation indexes are accurate and robust in disclosing similarity except when time series exhibit high variability. This is the case of most trafﬁc data that are highly variable in terms of number of connections, request inter-arrivals, ﬂow sizes (e.g., [13,16,14]). In these scenar- ios, popular correlation indexes, such as the Pearson coef- ﬁcient [7], the Spearman rank [8], the Kendall rank [9], and the Local Correlation index [10], show poor results because they are unable to capture correlations even when they exist. We propose a new similarity measure that is able to disclose correlation even when time series are character- ized by high variability. The accuracy and robustness of the proposed correlation index is achieved through an 1389-1286/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.comnet.2013.07.004 ⇑ Corresponding author. Address: Via Vignolese 905/B, 41125 Modena, Italy, Tel.: +39 0592056273; fax: +39 0592056129. E-mail addresses: stefania.tosi@unimore.it (S. Tosi), sara.casolari@ unimore.it (S. Casolari), michele.colajanni@unimore.it (M. Colajanni). Computer Networks 57 (2013) 3025–3038 Contents lists available at SciVerse ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet