Applying Twister to Scientific Applications Bingjing Zhang 1, 2 , Yang Ruan 1, 2 , Tak-Lon Wu 1, 2 , Judy Qiu 1, 2 , Adam Hughes 2 , Geoffrey Fox 1, 2 1 School of Informatics and Computing, 2 Pervasive Technology Institute Indiana University Bloomington {zhangbj, yangruan, taklwu, xqiu, adalhugh, gcf}@indiana.edu Abstract Many scientific applications suffer from the lack of a unified approach to support the management in execution and inefficiency in processing large-scale data. Twister MapReduce Framework, which not only supports traditional MapReduce programming model but also extends it with iterations, tries to address these problems. This paper describes how Twister is applied to several kinds of scientific applications such as BLAST, MDS Interpolation and GTM Interpolation in non-iterative style and MDS without interpolation in iterative style. The results show the applicability of Twister to data parallel and EM algorithms with small overhead and increased efficiency. Keywords Twister, Iterative MapReduce, Cloud, Scientific Applications 1. Introduction Scientific applications are required to process large amount of data. The volumes of Input data grow from gigabytes to terabytes, even petabytes scale now. This already far exceeds the computing capability of one computer. Although the computing tasks can be parallelized on several computers, the execution may still take days or weeks long. This situation demands better parallel algorithms and the distributed computing technologies which can manage the scientific applications efficiently. MapReduce Framework [1] is a kind of technology which becomes popular in recent years. KeyValue pairs make the input be distributed and parallel processed at a fine granularity. The combination of Map tasks and Reduce tasks satisfies the task flow of most kind of applications. And these tasks are also well managed under the runtime platform. This paper introduces Twister MapReduce Framework [2], an expansion of traditional MapReduce Framework. The main characteristic of it is that it does not only support non-iterative MapReduce applications but also iterative MapReduce programming model efficiently to support Expectation-maximization (EM) algorithms with communication complications, which is common in scientific applications but is not allowed by other former MapReduce implementations such as Hadoop [3]. Twister uses publish/subscribe messaging middleware system for command communication and data transfers. It supports MapReduce in manner of “configure once, and run many time” [2]. Data can be easily scattered from client node to compute nodes and combined back into client node by APIs. With these features, Twister can support iterative MapReduce computations efficiently when compared to other MapReduce runtimes. Twister is also compatible with Cloud architecture. Now it has been successfully deployed on Amazon EC2 platform [4]. In this paper, the applicability of Twister is mainly discussed. Through implementation of several scientific applications, this paper shows how these applications are well supported by Twister. In the following sections, the overview of Twister is firstly presented with introducing its programming model and architecture. Then four Twister scientific applications are discussed. Three of them are non-iterative programs which are Twister BLAST, Twister GTM Interpolation, and Twister MDS Interpolation. The final one is Twister MDS which is an iterative application. Workflow and parallel mechanism supported by Twister are presented within this section. The conclusion is drawn in the final section. 2. Twister Overview This section gives an overview to Twister MapReduce Framework. The first part illustrates how non-iterative and iterative MapReduce programming model are supported in Twister. The second part describes the architecture of Twister. 2.1. Non-Iterative and Iterative MapReduce Support Many parallel applications are only required to do Map and Reduce once, such as WordCount [1]. However, some other applications are inevitable to be in an iterative pattern such as Kmeans [5] and PageRank [6]. Their parallel algorithms require the program to do Map and Reduce in iterations in order to get the final result. The basic idea of Twister is to let MapReduce jobs only be configured once, then let it run in one turn or