MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1 Jiyeon Choo 1 , Rachsuda Jiamthapthaksin 1 , Chun-sheng Chen 1 , Oner Ulvi Celepcikay 1 , Christian Giusti 2 , and Christoph F. Eick 1 1 Computer Science Department, University of Houston, Houston, TX 77204-3010, USA 2 Department of Mathematics and Computer Science, University of Udine, Via delle Scienze, 33100, Udine, Italy 1 {jchoo, rachsuda, lyons19, onerulvi, ceick}@cs.uh.edu, 2 giusti@dimi.uniud.it Abstract. Representative-based clustering algorithms are quite popular due to their relative high speed and because of their sound theoretical foundation. On the other hand, the clusters they can obtain are limited to convex shapes and clustering results are also highly sensitive to initializations. In this paper, a novel agglomerative clustering algorithm called MOSAIC is proposed which greedily merges neighboring clusters maximizing a given fitness function. MOSAIC uses Gabriel graphs to determine which clusters are neighboring and approximates non-convex shapes as the unions of small clusters that have been computed using a representative-based clustering algorithm. The experimental results show that this technique leads to clusters of higher quality compared to running a representative clustering algorithm stand- alone. Given a suitable fitness function, MOSAIC is able to detect arbitrary shape clusters. In addition, MOSAIC is capable of dealing with high dimensional data. Keywords: Post-processing, hybrid clustering, finding clusters of arbitrary shape, agglomerative clustering, using proximity graphs for clustering. 1 Introduction Representative-based clustering algorithms form clusters by assigning objects to the closest cluster representative. k-means is the most popular representative-based clustering algorithm: it uses cluster centroids as representatives and iteratively updates clusters and centroids until no change in the clustering occurs. k-means is a relatively fast clustering algorithm with a complexity of O(ktn), where n is the number of objects, k is the number of clusters, and t is the number of iterations. The clusters generated are always contiguous. However, when using k-means the number of clusters k has to be known in advance, and k-means is very sensitive to initializations and outliers. Another problem of k-means clustering algorithm is that it 1 This paper appeared in Proceeding 9 th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Regensbug Germany, September 2007.