Finding Clusters in Subspaces of Very Large, Multi-dimensional Datasets Robson L. F. Cordeiro #1 , Agma J. M. Traina #2 , Christos Faloutsos *3 , Caetano Traina Jr. #4 # Computer Science Department - ICMC, University of S˜ ao Paulo 400 Trabalhador Saocarlense Ave, S˜ ao Carlos SP 668, Brazil 1 robson@icmc.usp.br 2 agma@icmc.usp.br 4 caetano@icmc.usp.br ∗ School of Computer Science, Carnegie Mellon University 5000 Forbes Ave, Pittsburgh PA 15213, USA 3 christos@cs.cmu.edu Abstract—We propose the Multi-resolution Correlation Clus- ter detection (MrCC), a novel, scalable method to detect corre- lation clusters able to analyze dimensional data in the range of around 5 to 30 axes. Existing methods typically exhibit super- linear behavior in terms of space or execution time. MrCC employs a novel data structure based on multi-resolution and gains over previous approaches in: (a) it finds clusters that stand out in the data in a statistical sense; (b) it is linear on running time and memory usage regarding number of data points and dimensionality of subspaces where clusters exist; (c) it is linear in memory usage and quasi-linear in running time regarding space dimensionality; and (d) it is accurate, deterministic, robust to noise, does not require stating the number of clusters as input parameter, does not perform distance calculation and is able to detect clusters in subspaces generated by original axes or linear combinations of original axes, including space rotation. We performed experiments on synthetic data ranging from 5 to 30 axes and from 12k to 250k points, and MrCC outperformed in time five of the recent and related work, being in average 10 times faster than the competitors that also presented high accuracy results for every tested dataset. Regarding real data, MrCC found clusters at least 9 times faster than the competitors, increasing their accuracy in up to 34 percent. I. I NTRODUCTION Traditional clustering methods suffer from the curse of di- mensionality and often fail to produce acceptable results when data dimensionality raises above ten so [1]. The dimensionality curse refers to the fact that, when dimensionality increases, the spaces tend to be very sparse and the distances between any point pair tend to become similar, regardless of the distance functions employed [2]. That is the reason why traditional clustering methods are likely to fail when facing this kind of data. However, it has been shown that multi-dimensional data tend to cluster in subspaces of the original space (i.e. sets of orthogonal vectors formed from the original axes or subset combinations of them) [3], [4], [5], [1], [6], [7]. This material is based upon work supported by FAPESP (S˜ ao Paulo State Research Foundation), CAPES (Brazilian Coordination for Improvement of Higher Level Personnel), CNPq (Brazilian National Council for Supporting Research) and the National Science Foundation under Grants No. DBI- 0640543. Any opinions, findings, and conclusions or recommendations ex- pressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. Dimensionality reduction aims at eliminating global corre- lations. To be detected, they must exist in every dataset region with respect to a set of orthogonal vectors formed by either the original axes (feature selection) or linear combinations of them (feature extraction). However, besides global correlations, lo- cal correlations are likely to exist in subsets of the original axes. In fact, a cluster may present correlations that differ from those presented in another cluster. Traditional dimensionality reduction does not solve the problem of identifying local clusters and correlations in sub-sets of the dimensions. In this paper we use the following notation. Definition 1: A multi-dimensional dataset d S is a set of η points in a d-dimensional space d S over the set of axes d E = {e 1 ,e 2 ,...e d }, where d is the dimensionality of the dataset and e j , 1 ≤ j ≤ d is an axis (also called a dimension or an attribute) of the space where the dataset is embedded. A point s i ∈ d S is a set of values s i = {s i1 ,s i2 ,...s id }. Without loss of generality, we assume that each value s ij , 1 ≤ i ≤ η, 1 ≤ j ≤ d is a real value in [0, 1), so the whole dataset is embedded in the d-dimensional hyper-cube [0, 1) d . Figure 1 shows examples of local correlations in two 3- dimensional datasets over the axes 3 E = {x, y, z} [5]. Figure 1a shows a 3-dimensional dataset 3 S projected onto axes x and y, while Figure 1b shows the same dataset projected onto axes x and z. We can see that there exist two clusters in this dataset, C 1 and C 2 . Cluster C 1 exists in the subspace formed by the two axes x and z, so we note it as 2 γ C 1 , while cluster C 2 exists in the subspace formed by the two axes x and y, so we note it also as 2 γ C 2 . The symbol γ is used to denote them as correlation clusters. It is not worth to apply a global dimensionality reduction technique over this dataset, as all axes are important to characterize at least one cluster. Also, traditional clustering methods are likely to fail, since both clusters are spread over an axis. Figures 1a and 1b correspond to a simple situation, as both clusters are aligned along one original axis. Thus, the clusters exist in subspaces of the original axes. Nevertheless, real data may also have clusters aligned along arbitrarily oriented axes. As an example, Figures 1c and 1d respectively represent x- 978-1-4244-5446-4/10/$26.00 2010 IEEE ICDE Conference 2010 625