Poster: Parallel Algorithms for Clustering and Nearest Neighbor Search Problems in High Dimensions Logan Moon Institute for Computational Engineering and Sciences The University of Texas at Austin, Austin, TX logan@ices.utexas.edu Daniel Long School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA dlong3@gatech.edu Shreyas Joshi School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA shreyasj@gatech.edu Vyomkesh Tripathi School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA vtripathi7@gatech.edu Bo Xiao School of Computational Science and Engineering Georgia Institute of Technology, Atlanta, GA boxiao33@gmail.com George Biros Institute for Computational Engineering and Sciences The University of Texas at Austin, Austin, TX gbiros@acm.org ABSTRACT Clustering and nearest neighbor searches in high dimensions are fundamental components of computational geometry, computational statistics, and pattern recognition. Despite the widespread need to analyze massive datasets, no MPI- based implementations are available to allow this analysis to be scaled to modern highly parallel platforms. We seek to develop a set of algorithms that will provide unprece- dented scalability and performance for these fundamental problems. Categories and Subject Descriptors: E.1[Data Structures]:Distributed Data Structures G.4[Mathematical Software]:Algorithm Design and Analysis G.4[Mathematical Software]:Parallel and Vector Implemen- tations General Terms: Algorithms 1. INTRODUCTION In this poster, we describe message-passing and shared memory parallel algorithms for nearest-neighbor and clus- tering in Euclidean spaces. Speciﬁcally, given a set R of n reference points {ri } n i=1 ∈ R d and a set Q of m query points {qj } m j=1 ∈ R d and deﬁning d(ri ,qj ) := ‖ri −qj ‖2, we consider the following two problems. • (1) kmeans clustering (KMC): Find {cj } k i=1 in R d by ﬁnding {cj } that minimize ∑ k j=1 ∑ r i ∈V j d(ri ,cj ) 2 , where Vj is the Voronoi set of cj , Vj = {ri ∈R : cj = minc l d(ri ,c l )}; and • (2) k- and range nearest neighbor searches (KNN): Given a query point q ∈Q, ﬁnd the k-nearest neighbors or ﬁnd all points r ∈R such that d(r, q) <ρ, where ρ is the range. Signiﬁcance. Nearest neighbor and clustering problems are fundamental problems in computational geometry. They Copyright is held by the author/owner(s). SC’11 Companion, November 12–18, 2011, Seattle, Washington, USA. ACM 978-1-4503-1030-7/11/11. are building blocks for more complex algorithms in com- putational statistics (e.g., kernel density estimation), spa- tial statistics (e.g., n-point correlation functions), and ma- chine learning (e.g., classiﬁcation, manifold learning). In turn, such methods are key components in physics (high- dimensional and generalized N-body problems), dimension reduction for scientiﬁc datasets, and uncertainty estimation. Despite KNN and KMC methods being fundamental building blocks for many algorithms in computational data analysis, there has not been a lot of work in scaling them to high- performance parallel platforms. There is a rich literature in distributed memory algorithms, particularly in the database community, but the technologies have not yet been migrated to high performance computing platforms. Related work. For a comprehensive review on cluster- ing methods and computational geometry, see [5] and, for parallel kmeans, see [8]. In [7], the authors present an eﬃ- cient kmeans algorithm based on a KD-tree data structure; the method was not parallelized. The basic scheme for par- allelizing kmeans is simple and is discussed in [3] and [6]. Although there is signiﬁcant amount of work on indexing structures in the database community (e.g., [2]), and on se- quential algorithms [4, 9] for generalized N-body problems and tree data structures, we are not aware of any message passing interface-based scalable algorithms for multidimen- sional trees. Overall, for both kmeans and k-NN problems, the previous work has been limited to shared memory im- plementations and distributed memory with quite a small number of processes. 2. METHODS We have implemented several scalable methods for solving both the k-nearest neighbors and ρ-near neighbors problems. Here, we give an overview of the algorithms and their parallel implementations. 2.1 PCL-Tree Multilevel Partitioning PCL-Tree is a fully distributed spatial indexing data structure which uses kmeans clustering to recursively parti- tion a data set across MPI processes. Of course, the PCL-