Comparing Map-Reduce and FREERIDE for Data-Intensive Applications Wei Jiang Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus OH 43210 {jiangwei,raviv,agrawal}@cse.ohio-state.edu Abstract—Map-reduce has been a topic of much interest in the last 2-3 years. While it is well accepted that the map-reduce APIs enable significantly easier programming, the performance aspects of the use of map-reduce are less well understood. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State, FREERIDE (FRamework for Rapid Implementation of Datamining Engines). The API and the functionality offered by FREERIDE has many similarities with the map-reduce API. However, there are some differences in the API. Moreover, while FREERIDE was motivated by data mining computations, map-reduce was motivated by searching, sorting, and related applications in a data-center. We compare the programming APIs and performance of the Hadoop implementation of map- reduce with FREERIDE. For our study, we have taken three data mining algorithms, which are k-means clustering, apriori association mining, and k-nearest neighbor search. We have also included a simple data scanning application, word-count. The main observations from our results are as follows. For the three data mining applications we have considered, FREERIDE outperformed Hadoop by a factor of 5 or more. For word-count, Hadoop is better by a factor of up to 2. With increasing dataset sizes, the relative performance of Hadoop becomes better. Overall, it seems that Hadoop has significant overheads related to initialization, I/O, and sorting of (key, value) pairs. Thus, despite an easy to program API, Hadoop’s map-reduce does not appear very suitable for data mining computations on modest-sized datasets. I. I NTRODUCTION The availability of large datasets and increasing impor- tance of data analysis in commercial and scientific domains is creating a new class of high-end applications. Recently, the term Data-Intensive SuperComputing (DISC) has been gaining popularity [1], and includes applications that per- form large-scale computations over massive datasets. The growing importance of data-intensive computing is closely coupled with the emergence and popularity of the map-reduce paradigm [2]. Implementations of this paradigm provide high-level APIs and runtime support for developing and executing applications that process large-scale datasets. Map-reduce has been a topic of growing interest in the last 2-3 years. On one hand, multiple projects have focused on trying to improve the API or implementations [3], [4], [5], [6]. On the other hand, many projects are underway focusing on the use of map-reduce implementations for data-intensive computations in a variety of domains. For example, early in 2009, NSF funded several projects for using the Google-IBM cluster and the Hadoop implementation of map-reduce for a variety of applications, including graph mining, genome sequencing, machine translation, analysis of mesh data, text mining, image analysis, and astronomy 1 . In evaluating any parallel programming system, two im- portant considerations are productivity and performance. With respect to productivity, it is well accepted that the map-reduce APIs enable significantly easier programming of the applications that can be suited for their APIs. Earlier alternatives for programming this class of applications on a cluster would have involved the use of MPI and explicit management of large datasets, which clearly is much more time-consuming. However, the performance aspects of map-reduce are less well understood. Some studies have shown speedups with increasing number of nodes [3], [7], [8], but most of them do not report any comparison with a different system. The class of data-intensive applications is also very broad, and includes applications that perform simple search over massive datasets, to machine learning and imaging on large datasets. The latter could involve significant amount of computations, besides the need for managing and processing a large dataset. Thus, it is clearly important to understand the relative performance of map-reduce and its implementations for different sub-classes of data-intensive applications. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State. FREERIDE (FRamework for Rapid Implementation of Datamining Engines) [9], [10], [11] was motivated by the difficulties in implementing and performance tuning parallel versions of data mining algorithms. FREERIDE is based upon the observation that parallel versions of several well -known data mining techniques share a relatively similar structure. We carefully studied parallel versions of Apri- ori association mining [12], Bayesian network for classi- fication [13], K-Means clustering [14], k-nearest neighbor classifier [15], artificial neural networks [15], and decision tree classifiers [16]. In each of these methods, parallelization in a distributed memory setting can be done by dividing the data instances (or records or transactions) among the nodes. The computation on each node involves reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction. The reduction involves only commutative and associative operations, which means the result is independent of the order in which the data instances are processed. After the local reduction on each node, a global reduction is performed. FREERIDE exploits this commonality to support a high-level interface 1 http://www.networkworld.com/community/node/27219