A Parallel Attractor-tree Based Clustering Method Baoying Wang Waynesburg University Waynesburg, PA 15370, USA ewang@waynesburg.edu Aijuan Dong Hood College Frederick, Maryland 21701, USA dong@hood.edu Abstract Data clustering methods have been proven to be a successful data mining technique. There are two kinds of clustering approaches: partitioning clustering and hierarchical clustering. Hierarchical clustering is more flexible than partitioning clustering. However, it is very expensive for large data sets. In this paper, we propose a parallel algorithm for attractor-tree based hierarchical clustering which is implemented using MPI (message passing interface). We run the parallel program on different numbers of machines and compared it with the sequential approach. Experiments show that our parallel approach speeds up the sequential clustering while producing a comparatively good clustering result. Keywords: data mining, clustering, parallel computing, attractor trees 1 Introduction Clustering techniques partition the data set into groups such that similar items fall into the same group [2]. Data clustering is a common data mining technique. However, some concerns and challenges still remain in clustering. Most partitioning clustering methods have to depend on input parameters, such as the number of clusters [4]. Hierarchical clustering is more flexible than partitioning clustering. However, hierarchical clustering is very computationally expensive for large data sets [6]. Scalable parallel computers can be used to speed up hierarchical clustering. Recently there has been an increasing interest in parallel implementations of data clustering algorithms [1, 5, 8, 10, 11, 14]. However, most existing parallel approaches have been developed for traditional agglomerative clustering. In this paper, we propose a parallel algorithm to implement our previous work, Clustering using Attractor tree and Merging Process (CAMP) [13], on MIMD (Multiple Instruction stream, Multiple Data stream) parallel machines using MPI. CAMP method clusters a dataset into a set of preliminary attractor trees and then merges the attractor trees until the whole dataset becomes one tree. CAMP is a hierarchical clustering method but is more like hybrid clustering [7, 9]. It is a combination of partitioning clustering and hierarchical clustering. It is faster than the traditional hierarchical clustering but still does not scale well when the data size increases. Experiments demonstrate that our parallel approach speeds up the sequential hierarchical clustering tremendously with comparatively good clustering results. This paper is organized as follows. In section 2 we briefly review the sequential CAMP clustering. We discuss the parallel approach for CAMP in section 3. The experimental results are presented in section 4. Finally, we conclude the paper in section 5. 2 Review Of The Sequential Algorithm In this section, we briefly review sequential Clustering using Attractor tree and Merging Process (CAMP). CAMP consists of two processes: (1) clustering using local attractor trees and (2) cluster merging process. The final clustering results consist of an attractor tree and a set of attractors corresponding to each level of the attractor tree. 2.1 Density Function Given a data point x in a data space X, the density function of x is defined as the sum of the influence functions of all data points in the data space X on x. In general, the influence of a data point on x is inversely proportional to its distance to x. If we divide the neighborhood of x into equal interval neighborhood rings (EINrings) as in Figure 1, then points within inner rings have more influence on x than those in outer rings. Figure 1. Diagram of EINrings. Let y be a data point within the k th EINring of x. The neighborhood EINring-based influence function of y on x is defined as: λ C λ A