Performance Analysis of Twister based MapReduce Applications on Virtualization System in FutureGrid *Yunhee Kang *Division of Information and Communication Baekseok University 115 Anseo Dong, Cheonan, Korea 330-704 yunh.kang@gmail.com Geoffrey C. Fox Pervasive Technology Institute Indiana University 2719 E 10th St. Bloomington, IN 47408, USA gcf@indiana.edu Abstract This paper presents the result of performance evaluation of Twister based MapReduce applications to facilitate a virtualized cluster system running in the FutureGrid. For this work, we set up a virtualized cluster system made of a set of VM instances. In this experiment we observe that the overall performance of a data intensive application is strongly affected by the configuration of the VMs and load balancing is essential to handling large files in the case of having limited computing resources in the FutureGrid. The result of this experiment can be used for selecting a set of VM instances in proportion to 1) how much data are processed 2) what type of application runs in the FutureGrid. It can be used to identify the bottleneck of the MapReduce application running on the virtualized cluster system with various VM instances. Keywords: MapReduce application, Cloud computing, Performance evaluation, FutureGrid 1. INTRODUCTION Cloud computing is designed to provide on-demand computing resources or services over the Internet with the reliability level of a data center [1]. Conceptually users acquire computing platforms or IT infrastructures from clouds and then run their applications on the inside clouds [12]. Virtualization technologies partition hardware and consolidate utilization of server workload so that it is used to reduce the actual number of physical servers and to improve scalability and elasticity according to workload in the cloud environment. Hence virtualization technologies are the bases of cloud computing. In cloud computing environment, a virtual machine (VM) is a computing platform that creates a virtualized layer between the computing hardware and the application. FutureGrid, a science cloud, plays a role as a resource provider, which has a cloud stack including Infrastructure as a Service(IaaS), Platform as a Sevice(PaaS) and Software as a Service(SaaS) [2]. The demanding requirements in the FutureGrid have led to the development of a new programming model like MapReduce. A MapReduce application is deployed at the provider’s computing infrastructure and its runtime supplies the developers a programming environment having a set of well-defined APIs that is referred to as PaaS. This paper describes the result of performance evaluation of two kinds of MapReduce applications: one is a data intensive application and the other is a computational intensive application. For this work, we construct a virtualized cluster system that consists of a set of nodes of which are VM instances in the FutureGrid [2]. In this paper we report that it is important to determine the configuration of a virtualized cluster system in order to run a MapReduce application efficiently in FutureGrid. Eventually we show that the appropriate selection of a set of VM instance types increases the overall utilization of resources in the FutureGrid. This approach is the way to identify the relationship between the type of applications and resources allocated for running them. The rest of paper is organized as follows: Section 2 describes the related works including a brief overview of MapReduce, Twister and the FutureGrid. Section 3 reports the results from our experimental study and the observations on the result of our experiments. Conclusions are presented in Section 4. 2. RELATED WORKS 2.1 MapReduce MapReduce, introduced by Dean and Ghemawat, is the most dominant programming model for developing applications in a cloud computing environment [3-5]. It is a kind of data parallel languages aimed at loosely coupled computations that execute over given data sets [3]. This requires dividing the workload across a large number of machines. The degree of parallelism depends on the input data size. In this programming model the computation takes a set of (key, value) pairs, and produces a set of output (key, value) pairs. It consists of two functions: map and reduce. These two functions have the following signatures: Map :: (key1, value1) list((key2, value2)) Reduce :: (key2, list(value2)) list((key3, value3)) Map function processes the input pairs (key1, value1) returning some other intermediary pair (key2, value2). Then the intermediary pairs are grouped together according to their key. After, each group will be processed by the reduce function which will output some new pairs of the form (key3, value3). In a MapReduce application supported by a MapReduce library, all map operations can be executed independently. Each reduce operation may depend on the outputs generated by any