Improved visual clustering of large multi-dimensional data sets Eduardo Tejada Institut for Visualization and Interactive Systems University of Stuttgart eduardo.tejada@vis.uni-stuttgart.de Rosane Minghim Institut of Mathematics and Computer Science University of Sao Paulo rminghim@icmc.usp.br Abstract Lowering computational cost of data analysis and vi- sualization techniques is an essential step towards includ- ing the user in the visualization. In this paper we present an improved algorithm for visual clustering of large multi- dimensional data sets. The original algorithm is an ap- proach that deals efficiently with multi-dimensionality using various projections of the data in order to perform multi- space clustering, pruning outliers through direct user inter- action. The algorithm presented here, named HC-Enhanced (for Human-Computer enhanced), adds a scalability level to the approach without reducing clustering quality. Addi- tionally, an algorithm to improve clusters is added to the approach. A number of test cases is presented with good results. 1 Introduction Clustering large multi-dimensional data sets presents two major problems besides generating a good cluster- ing: scalability and capacity for dealing with multi- dimensionality. Scalability usually means linear complexity (O(n)) or better (O(n log n), O(log n), etc.). For highly interactive systems, such as visual clustering techniques [1, 11, 16], scalability is critical. Algorithms with linear complexity but high constant (O(kn), k constant) are not suitable for such systems: interactiveness is a key issue that must be included in visual mining techniques [20], thus the response time must be as low as possible. Treating multi-dimensionality is also a very difficult task. The research done for tackling this problem are based in three approaches: subspace clustering, co-clustering [4], and feature selection [19]. These approaches are focused on solving the problem known as the “dimensionality curse”, which is the incapacity for generating significative struc- tures (patterns or models) from high dimensional data. For clustering algorithms in most cases this means more than 15 dimensions [4, 5]. Subspace clustering refers to approaches that apply di- mensionality reduction before clustering the data. Different approaches for dimensionality reduction have been largely used, such as Principal Components Analysis (PCA) [12], Fastmap [7], Singular Value Decomposition (SVD) [17], and Fractal-based techniques [13, 15]. We have also devel- oped a novel technique named Nearest-Neighbor Projection (NNP) for multi-dimensional data exploration on bidimen- sional spaces [18]. For all these approaches, there is no warranty of dimensionality reduction without loosing a con- siderable amount of information and they are likely to find different clusters cin different projections of the same data as shown in the literature [1, 2, 3]. Thus, clustering in pro- jected subspaces could lead to a poor result of the clustering process. Additionally, clustering quality evaluation is de- pendent on the application. These are the reasons why gen- erating a good clustering, for an specific application, cannot be achieved without direct user interference. Visual cluster- ing techniques exploit this fact by replacing usually costly heuristics with user decisions. Besides these facts, it is also very desirable for clustering approaches to provide mechanisms for handling outliers 1 and to define a metric for determining whether a cluster is consistent with the user responses. Table 1 summarizes the features of the most representa- tive approaches of the various families of clustering algo- rithms 2 . Aggarwal’s IPCLUS algorithm [1], accomplishes almost all the requirements mentioned above. However, IP- CLUS cannot be applied to very large data sets due to the costly processes used for projecting the data and estimating the density. In this work we have developed mechanisms for reduc- ing the time spent in those processes. Those mechanisms were introduced in different steps of the algorithm. Results demonstrate a processing time reduction of 50% to 92%, as well as clustering improvement in some cases and same clustering quality in all the others. This improve- 1 Instances that cannot be included in any cluster. 2 See the work by Berkhin for details on most clustering algorithms found in the literature [4].