Implementing a Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish, Ananda Prakash Verma, and Shrisha Rao International Institute of Information Technology, Bangalore 560 100, India {jyotiskanath.khasnabish,anandaprakash.verma}@iiitb.org, shrao@ieee.org Abstract. Apache Hadoop is an open source framework for process- ing massive amount of data in a distributed environment. Hadoop ser- vices use a polling mechanism for event notifications. In this paper, we propose a distributed notification system for Hadoop based on the Publish-Subscribe model. Such a notification system can be used for message-passing among Hadoop services. It can also be used to chain multiple MapReduce jobs based on events occuring in a Hadoop cluster. This results in reduced cluster load and network bandwidth requirement. We have used two popular Publish-Subscribe-based messaging systems— Apache ActiveMQ and Apache Kafka—for implementation. Lastly, we have executed performance tests on both these messaging systems to monitor time taken for message delivery and reception. 1 Introduction In recent years, Hadoop [2] has become a de facto standard for processing massive amounts of data in a distributed environment. With growing numbers of Hadoop- based services each year, the necessity to implement an event-based notification system has grown significantly. In the Hadoop Summit 2011 [13], Yahoo disclosed that their primary workflow manager ‘Oozie’ manages over 600,000 processed jobs per month internally on their cluster, with the total number of users being more than 300. According to their prediction, the number of jobs will grow to a larger number in coming years. Different Hadoop services like MapReduce [10] computations produce large number of jobs every hour. Often these services are run together on a Hadoop cluster with several other Hadoop services to perform complex data-intensive operations. We have designed and implemented a notification system on Hadoop using the Publish-Subscribe [11] model, which provides high performance and scalable solution for passing messages between different services. In this system, one node or one service can play one of the following two roles, ‘Publisher’ or ‘Subscriber.’ The benefit of using the Publish-Subscribe model is that the Publishers are con- nected to the Subscribers through one or more than one message brokers rather S.C. Satapathy et al. (eds.), ICT and Critical Infrastructure: Proceedings of the 48th Annual 543 Convention of CSI - Volume I, Advances in Intelligent Systems and Computing 248, DOI: 10.1007/978-3-319-03107-1_ 60, c Springer International Publishing Switzerland 2014