MARISSA: MApReduce Implementation for Streaming Science Applications E. Dede, Z. Fadika, J. Hartog, M. Govindaraju SUNY Binghamton Binghamton, NY 13902 Email:{edede1,zfadika,jhartog1,mgovinda}@cs.binghamton.edu L. Ramakrishnan, D. Gunter, R. Canon Lawrence Berekely National Lab Berkeley, CA 94720 Email:{lramakrishnan,dkgunter, scanon}@lbl.gov Abstract—MapReduce has since its inception been steadily gaining ground in various scientiﬁc disciplines ranging from space exploration to protein folding. The model poses a challenge for a wide range of current and legacy scientiﬁc applications for addressing their ”Big Data” challenges. For example: MapRe- duce’s best known implementation, Apache Hadoop, only offers native support for Java applications. While Hadoop streaming supports applications compiled in a variety of languages such as C, C++, Python and FORTRAN, streaming has shown to be a less efﬁcient MapReduce alternative in terms of performance, and ef- fectiveness. Additionally, Hadoop streaming offers lesser options than its native counterpart, and as such offers less ﬂexibility along with a limited array of features for scientiﬁc software. The Hadoop File System (HDFS), a central pillar of Apache Hadoop is not a POSIX compliant ﬁle system. In this paper, we present an alternative framework to Hadoop streaming to address the needs of scientiﬁc applications: MARISSA (MApReduce Implementation for Streaming Science Applications). We describe MARISSA’s design and explain how it expands the scientiﬁc applications that can beneﬁt from the MapReduce model. We also compare and explain the performance gains of MARISSA over Hadoop streaming. I. I NTRODUCTION Evolving scientiﬁc instruments and the rapid sophistication of computing systems have resulted in large-scale scientiﬁc simulations and data analysis workﬂows. Today, scientists in a variety of disciplines such as earthquake simulation [32], bioinformatics [13], climate science [25], and astrophysics [9], generate data at increasingly larger scales than was possible before. As more and more scientiﬁc data is generated, our abil- ity to effectively manage and process such data also needs to evolve. MapReduce, since its introduction at the 6th USENIX Symposium on Operating Systems Design and Implementation [14], has been widely used to this end. The MapReduce model is inspired from functional programming. The model allows a the uniform application of map and reduce functions to nearly equally split data amongst participating nodes. Among its most attractive qualities, the MapReduce model counts: inherent data management, parallelization/synchronization abstraction and fault-tolerance. For the scientist or the programmer, this means the advantage of being absolved from providing paral- lelization and synchronization features to programs, as those features are automatically managed by the framework. Sim- ilarly, data management and fault-tolerance (in case of node failures) are abstracted away from the user and are instead the responsibility of the MapReduce framework. Apache Hadoop [1], the most widely used MapReduce framework, provides these same advantages. Hadoop native does not provide sup- port for application source code written in languages other than Java. While Hadoop streaming attempts to address this problem in enabling scripts and executable binaries to run on its framework, our previous work [19] has shown (see Table I and Figure 2 for summarized results) the negative performance impact displayed by streaming applications to be considerable. In this paper, we use the word streaming as in the context of Hadoop, which provides a mechanism to run non-Java applications from within the context of a Java-based MapReduce framework. Hadoop MapReduce relies on the Hadoop Distributed File System (HDFS) [33], a non POSIX compliant ﬁlesystem, for its data and cluster operations. Super computing facilities such as the National Energy Research Scientiﬁc Computing Center (NERSC) [5], part of Lawrence Berkeley National Laboratory (LBNL) and scientiﬁc cluster computing centers such as TeraGrid [8] primarily rely on POSIX-compliant ﬁle systems. Thus, for scientiﬁc computing, ﬁlesystems such as GFS2 [34], GPFS [31], Lustre [3], rather than HDFS are widely adopted, making the adoption of MapReduce difﬁcult, and reducing availability to scientists. Finally, Hadoop streaming, in its current form, although capable of generic data-intensive com- puting, lacks features most attractive for scientiﬁc applications. We present in this paper, MARISSA (MApReduce Imple- mentation for Streaming Science Applications), a MapReduce framework offering better performance and faster application turnaround time than Hadoop streaming, while capable of fully supporting a variety of POSIX compliant ﬁle systems. The contributions of this paper are the following: • We present the design and implementation of a MapRe- duce streaming framework capable of running not only Java applications, but also any executable binary. • Provide evidence illustrating a considerable performance improvement over Hadoop streaming both under normal and under availability variations.