Large-Scale Clustering using MPI-based Canopy Jacek Burys, Ahsan Javed Awan, Thomas Heinis Department of Computing, Imperial College London, UK Abstract—Analyzing massive amounts of data and extracting value from it has become key across different disciplines. Cluster- ing is a common technique to find patterns in the data. Existing clustering algorithms require parameters to be set a priori. The parameters are usually determined through trial and error in several iterations or through pre-clustering algorithms, which do not scale well for the massive amounts of data. In this paper, we thus take one such pre-clustering algorithm, Canopy, and develop a parallel version based on MPI. As we show, doing so is not straightforward and without optimization, a considerable amount of time is spent waiting for synchronisation, severely limiting scalability. We thus optimize our approach to spend as little time as possible with idle cores and synchronization barriers. As our experiments show, our approach scales near linear with increasing dataset size. I. I NTRODUCTION Data analysis plays a crucial role across different scientific fields and industrial applications. It is used to find patterns in data that can then help us gain new insights and advance our understanding. One class of analytics approaches, cluster- ing algorithms, are a critical step in data mining when the information has to be extracted by clustering a plethora of data points into distinguishable groups across the whole data set. Examples of their applications span academic fields like bioinformatics [1] or cancer research [2], but they also find use in more commercial applications, such as market research. While there exists a plethora of tools and algorithms for clustering data today, most of them have a considerable com- plexity: they are very efficient on the small datasets we dealt with until recently but as the amounts of data grow rapidly these algorithms do not scale well. Given the rapidly growing amounts of data in need of analysis, efforts in researching parallel and distributed versions of clustering algorithms have increased in recent years. For example, MR-DBSCAN [3] is a modified version of the well known and broadly used DBSCAN. It uses MapReduce [4] to allow clustering to scale up distributedly. Essential to clustering, however, is also pre-clustering. Pre-clustering is needed as the process of clustering is not straightforward. Instead, clustering requires a priori knowledge of parameters (number of clusters for k-means [5] or similar). These parameters are typically determined in an iterative process through trial and error or through a pre-clustering algo- rithm. Several pre-clustering algorithms have been developed. An example pre-clustering algorithm is Canopy Clustering [6] used before clustering with, for example, k-means. It works by dividing data into overlapping sets called canopies which allows for a reduction in the number of distance measurements because only pairs of data points within the same canopy have to be considered. Only very limited work, however, has been done on scaling and parallelizing pre-clustering algorithms. In this paper, we thus develop a parallel version of the Canopy Clustering algo- rithm based on MPI. As we will show, implementing Canopy Clustering is not straightforward. We thus iteratively develop three variants of it and in each successive version reduce time wasted due to synchronisation barriers. We experimentally evaluate all three versions of it. The remainder of this paper is structured as follows. We first provide the background on Canopy clustering and the Message Passing Interface (MPI) in Section II. We then discuss how we develop and optimize and implementation of Canopy on MPI in Section III and evaluate the performance experimentally in Section IV. We finally draw conclusions in Section VI. II. BACKGROUND In the following we first discuss Canopy Clustering and then the Message Passing Interface. A. Canopy Clustering Algorithm Canopy Clustering is a comparatively simple pre- processing step intended to be used before the data is clustered using a different algorithm (e.g., k-means). This pre-processing step tries to speed up clustering by dividing the data points into multiple subsets which are called canopies. The points within one canopy are close to each other and only other points with in the same canopy have to be considered during clustering. Crucially, one data point can be assigned to multiple canopies, i.e., the canopies are not disjoint sets. Algorithm 1 Canopy Clustering Input: Data set x 1 ,x 2 , ..., x m ∈ R n , T 1 , T 2 Output: Canopies 1: Begin with a set of data points, S. 2: Repeat until S is empty: 3: Select a random point c from S. 4: Remove c from S and start a new canopy. 5: For each point x remaining in S: 6: Compute distance d from x to c. 7: If d<T 1 assign x to the new canopy. 8: If d<T 2 remove x from S. Algorithm 1 illustrates the canopy clustering algorithm using pseudocode. The algorithm starts with a set of points that will form the canopies as well as distances T 1 and T 2. T 1 is the loose distance, while T 2 is the tight distance. The algorithm selects a random point to form a new canopy and removes it from the set. It then iterates through all remaining