A Parallel Algorithm for Incremental Compact Clustering Reynaldo Gil-Garc´ ıa 1 , Jos´ e M. Bad´ ıa-Contelles 2 , and Aurora Pons-Porrata 1 1 Universidad de Oriente, Santiago de Cuba, Cuba {gil,aurora}@app.uo.edu.cu 2 Universitat Jaume I, Castell´on, Spain badia@icc.uji.es Abstract. In this paper we propose a new parallel clustering algorithm based on the incremental construction of the compact sets of a collection of objects. This parallel algorithm is portable to different parallel archi- tectures and it uses the MPI library for message-passing. We also include experimental results on a cluster of personal computers, using synthetic data generated randomly and collections of documents. Our algorithm balances the load among the processors and tries to minimize the com- munications. The experimental results show that the parallel algorithm clearly improves its sequential version with large sets of data. 1 Introduction Clustering algorithms are widely used for document classification, clustering of genes and proteins with similar functions, event detection and tracking on a stream of news, image segmentation and so on. Given a collection of n objects characterized by m features, clustering algorithms try to construct partitions or covers of this collection. The similarity among the objects in the same cluster should be maximum, whereas the similarity among objects in different clus- ters should be minimum. The clustering algorithms have three main elements, namely: the representation space, the similarity measure and the clustering cri- terion. In many applications, the collection of objects is dynamic, with new items being added on a regular basis. An example of these applications is the event detection and tracking of streams of news. Classic algorithms need to know all the objects in order to perform the clustering and so, each time we modify the set of objects, it is necessary to cluster the whole collection again. Thus, we need algorithms able to update the clusters each time a new object is added to the data without rebuilding the whole set of clusters. This kind of algorithms are called incremental. Many recent applications involve huge data sets that cannot be clustered in a reasonable time using one processor. Moreover, in many cases the data cannot This work was partially supported by the Spanish CICYT projects TIC 2002-04400- C03-01 and TIC 2000-1683-C03-03. H.Kosch,L.B¨osz¨orm´ enyi, H. Hellwagner (Eds.): Euro-Par 2003, LNCS 2790, pp. 310–317, 2003. c Springer-Verlag Berlin Heidelberg 2003