Cluster Comput DOI 10.1007/s10586-017-1031-0 Making a case for the on-demand multiple distributed message queue system in a Hadoop cluster Cao Ngoc Nguyen 1 · Soonwook Hwang 1 · Jik-Soo Kim 2 Received: 28 February 2017 / Revised: 20 May 2017 / Accepted: 3 July 2017 © Springer Science+Business Media, LLC 2017 Abstract In this paper, we present a framework that can provide users with a simple, convenient and powerful way to deploy multiple message queue system on demand in a Hadoop cluster. Specifically, we are leveraging the Apache Kafka which is one of the state of art distributed mes- sage queue systems that can achieve high throughput, low latency, and good load balancing. Our framework provides automation of setting up and starting Kafka brokers on the fly and users can leverage the framework to quickly adopt Kafka without spending much efforts on installation and con- figuration challenges. In addition, the framework supports users to run their Kafka-based applications without detailed knowledge about the Hadoop YARN APIs and underlying mechanisms. We present a use case of the framework to evaluate Kafka’s performance with various test cases and working scenarios. The experimental results allow Kafka’s potential users to perceive the influences of different settings on the queuing performance. Keywords Distributed message queue · Kafka · Hadoop · YARN · Many-task computing · MOHA B Jik-Soo Kim jiksoo@mju.ac.kr Cao Ngoc Nguyen cao@kisti.re.kr Soonwook Hwang hwang@kisti.re.kr 1 Korea Institute of Science and Technology Information, University of Science & Technology, Daejeon, Republic of Korea 2 Department of Computer Engineering, Myongji University, Yongin, Republic of Korea 1 Introduction Distributed message queue systems can enable us to build a large-scale distributed system by loosely coupling autonomic computational units. Based on message queue systems, inde- pendent computing elements can exchange information (e.g. messages, tasks) in an asynchronous fashion without tightly coupled integration efforts which can improve overall scal- ability and flexibility of the system. However, in a very large-scale system with many compute nodes, a message queue system potentially becomes a performance bottleneck or a single point of failure which can limit the overall utiliza- tion and robustness of computational resources. Therefore, careful calibration and evaluation of message queue systems can be an important research topic in a large-scale dis- tributed system. Message queue systems such as ZeroMQ [5], ActiveMQ [20], and RabbitMQ [18] have been actively used in many different middleware frameworks. Recently, Apache released a new open-source distributed message queue sys- tem called Kafka [8] which has been widely adopted by many enterprises such as LinkedIn, Yahoo and Netflix. A study of analyzing capabilities of the most used middleware sys- tems [13] showed that in a particular use case, Kafka performs better than other solutions with higher message exchange rate, data integrity, and availability. On the other hand, Hadoop [21] has become the de facto big data store and processing infrastructure by leveraging a robust and scalable distributed file system (HDFS [19]) and an efficient parallel processing framework (MapRe- duce [3]). With the advent of Apache Hadoop YARN [22], current Hadoop platform is now evolving into multi-use data platform that can support various types of data pro- cessing workflows. Based on these new features, we have worked on design and implementation of a new data pro- cessing framework (called MOHA [7]) that can effectively 123