MARISSA: MApReduce Implementation for Streaming Science Applications E. Dede, Z. Fadika, J. Hartog, M. Govindaraju SUNY Binghamton Binghamton, NY 13902 Email:{edede1,zfadika,jhartog1,mgovinda}@cs.binghamton.edu L. Ramakrishnan, D. Gunter, R. Canon Lawrence Berekely National Lab Berkeley, CA 94720 Email:{lramakrishnan,dkgunter, scanon}@lbl.gov Abstract—MapReduce has since its inception been steadily gaining ground in various scientific disciplines ranging from space exploration to protein folding. The model poses a challenge for a wide range of current and legacy scientific applications for addressing their ”Big Data” challenges. For example: MapRe- duce’s best known implementation, Apache Hadoop, only offers native support for Java applications. While Hadoop streaming supports applications compiled in a variety of languages such as C, C++, Python and FORTRAN, streaming has shown to be a less efficient MapReduce alternative in terms of performance, and ef- fectiveness. Additionally, Hadoop streaming offers lesser options than its native counterpart, and as such offers less flexibility along with a limited array of features for scientific software. The Hadoop File System (HDFS), a central pillar of Apache Hadoop is not a POSIX compliant file system. In this paper, we present an alternative framework to Hadoop streaming to address the needs of scientific applications: MARISSA (MApReduce Implementation for Streaming Science Applications). We describe MARISSA’s design and explain how it expands the scientific applications that can benefit from the MapReduce model. We also compare and explain the performance gains of MARISSA over Hadoop streaming. I. I NTRODUCTION Evolving scientific instruments and the rapid sophistication of computing systems have resulted in large-scale scientific simulations and data analysis workflows. Today, scientists in a variety of disciplines such as earthquake simulation [32], bioinformatics [13], climate science [25], and astrophysics [9], generate data at increasingly larger scales than was possible before. As more and more scientific data is generated, our abil- ity to effectively manage and process such data also needs to evolve. MapReduce, since its introduction at the 6th USENIX Symposium on Operating Systems Design and Implementation [14], has been widely used to this end. The MapReduce model is inspired from functional programming. The model allows a the uniform application of map and reduce functions to nearly equally split data amongst participating nodes. Among its most attractive qualities, the MapReduce model counts: inherent data management, parallelization/synchronization abstraction and fault-tolerance. For the scientist or the programmer, this means the advantage of being absolved from providing paral- lelization and synchronization features to programs, as those features are automatically managed by the framework. Sim- ilarly, data management and fault-tolerance (in case of node failures) are abstracted away from the user and are instead the responsibility of the MapReduce framework. Apache Hadoop [1], the most widely used MapReduce framework, provides these same advantages. Hadoop native does not provide sup- port for application source code written in languages other than Java. While Hadoop streaming attempts to address this problem in enabling scripts and executable binaries to run on its framework, our previous work [19] has shown (see Table I and Figure 2 for summarized results) the negative performance impact displayed by streaming applications to be considerable. In this paper, we use the word streaming as in the context of Hadoop, which provides a mechanism to run non-Java applications from within the context of a Java-based MapReduce framework. Hadoop MapReduce relies on the Hadoop Distributed File System (HDFS) [33], a non POSIX compliant filesystem, for its data and cluster operations. Super computing facilities such as the National Energy Research Scientific Computing Center (NERSC) [5], part of Lawrence Berkeley National Laboratory (LBNL) and scientific cluster computing centers such as TeraGrid [8] primarily rely on POSIX-compliant file systems. Thus, for scientific computing, filesystems such as GFS2 [34], GPFS [31], Lustre [3], rather than HDFS are widely adopted, making the adoption of MapReduce difficult, and reducing availability to scientists. Finally, Hadoop streaming, in its current form, although capable of generic data-intensive com- puting, lacks features most attractive for scientific applications. We present in this paper, MARISSA (MApReduce Imple- mentation for Streaming Science Applications), a MapReduce framework offering better performance and faster application turnaround time than Hadoop streaming, while capable of fully supporting a variety of POSIX compliant file systems. The contributions of this paper are the following: • We present the design and implementation of a MapRe- duce streaming framework capable of running not only Java applications, but also any executable binary. • Provide evidence illustrating a considerable performance improvement over Hadoop streaming both under normal and under availability variations.