Cluster Comput
DOI 10.1007/s10586-017-1031-0
Making a case for the on-demand multiple distributed message
queue system in a Hadoop cluster
Cao Ngoc Nguyen
1
· Soonwook Hwang
1
· Jik-Soo Kim
2
Received: 28 February 2017 / Revised: 20 May 2017 / Accepted: 3 July 2017
© Springer Science+Business Media, LLC 2017
Abstract In this paper, we present a framework that can
provide users with a simple, convenient and powerful way
to deploy multiple message queue system on demand in a
Hadoop cluster. Specifically, we are leveraging the Apache
Kafka which is one of the state of art distributed mes-
sage queue systems that can achieve high throughput, low
latency, and good load balancing. Our framework provides
automation of setting up and starting Kafka brokers on the
fly and users can leverage the framework to quickly adopt
Kafka without spending much efforts on installation and con-
figuration challenges. In addition, the framework supports
users to run their Kafka-based applications without detailed
knowledge about the Hadoop YARN APIs and underlying
mechanisms. We present a use case of the framework to
evaluate Kafka’s performance with various test cases and
working scenarios. The experimental results allow Kafka’s
potential users to perceive the influences of different settings
on the queuing performance.
Keywords Distributed message queue · Kafka · Hadoop ·
YARN · Many-task computing · MOHA
B Jik-Soo Kim
jiksoo@mju.ac.kr
Cao Ngoc Nguyen
cao@kisti.re.kr
Soonwook Hwang
hwang@kisti.re.kr
1
Korea Institute of Science and Technology Information,
University of Science & Technology, Daejeon,
Republic of Korea
2
Department of Computer Engineering, Myongji University,
Yongin, Republic of Korea
1 Introduction
Distributed message queue systems can enable us to build a
large-scale distributed system by loosely coupling autonomic
computational units. Based on message queue systems, inde-
pendent computing elements can exchange information (e.g.
messages, tasks) in an asynchronous fashion without tightly
coupled integration efforts which can improve overall scal-
ability and flexibility of the system. However, in a very
large-scale system with many compute nodes, a message
queue system potentially becomes a performance bottleneck
or a single point of failure which can limit the overall utiliza-
tion and robustness of computational resources. Therefore,
careful calibration and evaluation of message queue systems
can be an important research topic in a large-scale dis-
tributed system. Message queue systems such as ZeroMQ [5],
ActiveMQ [20], and RabbitMQ [18] have been actively used
in many different middleware frameworks. Recently, Apache
released a new open-source distributed message queue sys-
tem called Kafka [8] which has been widely adopted by many
enterprises such as LinkedIn, Yahoo and Netflix. A study
of analyzing capabilities of the most used middleware sys-
tems [13] showed that in a particular use case, Kafka performs
better than other solutions with higher message exchange
rate, data integrity, and availability.
On the other hand, Hadoop [21] has become the de facto
big data store and processing infrastructure by leveraging
a robust and scalable distributed file system (HDFS [19])
and an efficient parallel processing framework (MapRe-
duce [3]). With the advent of Apache Hadoop YARN [22],
current Hadoop platform is now evolving into multi-use
data platform that can support various types of data pro-
cessing workflows. Based on these new features, we have
worked on design and implementation of a new data pro-
cessing framework (called MOHA [7]) that can effectively
123