Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 43210 jinr@cis.ohio-state.edu Gagan Agrawal Department of Computer and Information Sciences Ohio State University, Columbus OH 43210 agrawal@cis.ohio-state.edu ABSTRACT In this paper, we focus on using a cluster of SMPs for scalable data mining. We have developed distributed memory and shared memory parallelization techniques that are applicable to a number of common data mining algorithms. These techniques are incorpo- rated in a middleware called FREERIDE (FRamework for Rapid Implementations of Datamining Engines). We present experimental evaluation of our techniques and frame- work using apriori association mining, k-means clustering, and a decision tree algorithm. We achieve excellent speedups for apriori and k-means, and good distributed memory speedup for decision tree construction. Despite using a common set of techniques and a middleware with a high-level interface, the speedups we achieve compare well against the reported performance from stand-alone implementations of individual parallel data mining algorithms. Over- all, our work shows that a common framework can be used for ef- ficiently parallelizing algorithms for different data mining tasks. 1. INTRODUCTION In recent years, cluster of SMPs have emerged as a cost-effective, flexible, and popular parallel processing configuration. Clusters of SMPs offer both distributed memory parallelism (across nodes of the cluster) and shared-memory parallelism (within a node). This imposes an additional challenge in parallelizing any class of appli- cations on these systems. In this paper, we focus on using cluster of SMPs for data mining tasks. Our contributions are three fold. First, we have developed a set of techniques for distributed memory as well as shared mem- ory parallelization that apply across a number of popular data min- ing algorithms. Second, we have incorporated these techniques in a middleware which offers a high-level programming interface to the application developers. Our middleware is called FREERIDE (FRamework for Rapid Implementation of Datamining Engines). Third, we present a detailed evaluation of our techniques and frame- work using three popular data mining algorithms, apriori associa- tion mining, k-means clustering, and a decision tree construction algorithm. Our work is based on the observation that a number of popular data mining algorithms, including apriori association mining [2], k-means clustering [12], and decision tree classifiers [15] share a relatively similar structure. Their common processing structure is essentially that of generalized reductions. The computation on each node involves reading the data instances in an arbitrary order, pro- cessing each data instance, and updating elements of a reduction This work was supported by NSF grant ACR-9982087, NSF CA- REER award ACR-9733520, and NSF grant ACR-0130437. object using associative and commutative operators. In a distributed memory setting, such algorithms can be par- allelized by dividing the data items (or records or transactions) among the nodes and replicating the reduction object. Each node can process the data items it owns to perform a local reduction. After local reduction on all nodes, a global reduction can be per- formed. In a shared memory setting, parallelization can be done by assigning different data items to different threads. The main challenge in maintaining the correctness is avoiding race conditions when different threads may be trying to update the same element of the reduction object. We have developed a number of techniques for avoiding such race conditions. Our middleware incorporates techniques for both distributed memory and shared memory paral- lelization and offers a high-level programming interface. We present a detailed experimental evaluation of our techniques and the framework by parallelizing three popular data mining al- gorithms, apriori association mining, k-means clustering, and a de- cision tree construction algorithm. We achieve excellent speedups for apriori and k-means, and good distributed memory speedup for decision tree construction. Despite using a common set of tech- niques and a middleware with a high-level interface, the speedups we achieve compare well against the reported performance from stand-alone implementations of individual parallel data mining al- gorithms. Our work shows that a common framework can be used for efficiently parallelizing algorithms for different data mining tasks. Moreover, we have also demonstrated that clusters of SMPs are well suited for execution of mining algorithms. The rest of this paper is organized as follows. Section 2 reviews parallel versions of several common data mining techniques. Tech- niques for both shared memory and distributed memory paralleliza- tion are presented in Section 3. Experimental results are presented in Section 4. We compare our work with related research efforts in Section 5 and conclude in Section 6. 2. PARALLELDATA MINING ALGORITHMS In this section, we describe how several commonly used data mining techniques can be parallelized in a very similar way. Our discussion focuses on three important techniques: apriori associating mining [2], k-means clustering [12], and a decision tree construction algorithm [7]. 2.1 Apriori Association Mining Association rule mining is the process of analyzing a set of trans- actions to extract association rules and is a very commonly used and well-studied data mining problem [3, 23]. Given a set of trans- actions (each of them being a set of items), an association rule is an expression , where and are the sets of items. Such