2014 International Conference on Information, Communication Technology and System
978-1-4799-6858-9/14/$31.00 ©2014 IEEE ICTS 2014, Surabaya, Indonesia
Processing Performance on Apache Pig, Apache Hive
and MySQL Cluster
Ammar Fuad, Alva Erwin, Heru Purnomo Ipung
Information Technology, Swiss German University
Edutown BSD City, Tangerang 15339, Indonesia
1
ammar.fuad[at]student.sgu.ac.id
2
alva_erwin[at]yahoo.com
3
heru.ipung[at]sgu.ac.id
Abstract—MySQL Cluster is a famous clustered database that is
used to store and manipulate data. The problem with MySQL
Cluster is that as the data grows larger, the time required to
process the data increases and additional resources may be
needed. With Hadoop and Hive and Pig, processing time can be
faster than MySQL Cluster. In this paper, three data testers
with the same data model will run simple queries and to find out
at how many rows Hive or Pig is faster than MySQL Cluster.
The data model taken from GroupLens Research Project [12]
showed a result that Hive is the most appropriate for this data
model in a low-cost hardware environment.
Keywords— Hadoop; Hive; Pig; MySQL; MySQL Cluster;
Processing big data;
I. INTRODUCTION
Hadoop is a popular open-source implementation of
MapReduce that is used by academics, goverments, and
industrial organizations. Hadoop can be used for storing large
data and for processing data such as data mining, report
generation, file analysis, web indexing, and bioinformatic
research [2].
MySQL Cluster is a MySQL server with one or more data
storages and management servers to configure the cluster and
data replication. MySQL Cluster provides 99.999%
availability to the data. MySQL Cluster is designed for
distributed node architecture with no single point of failure. It
consists of multiple nodes that are distributed across machines
to make sure the system can work, even in case a node having
a problem such as network failure [11].
Apache Hive and Apache Pig are open source programs for
analyzing large data sets in a high-level language. Apache Pig
is a simple query algebra that lets the user declare data
transformation to files or groups of files. Hive is data
warehouse software that facilitates queries and manages a
large data set in distributed storage. Hive and Pig run on top
of Hadoop [5]-[9].
When it comes to querying large data sets on MySQL
Cluster, it can take seconds (assuming that the query is pretty
complex). As the data grows larger, the time required to
process the data increases too. This is where Hadoop fits in
with Hive and Pig.
This paper presents the processing time of Hive, Pig, and
MySQL Cluster on a simple data model with simple queries
while the data is growing. Section 3 discusses a proposed
method. Section 4 shows the results and explanations. And
the last section, section 5 provides a conclusion and possible
future work.
II. RELATED WORKS
Hive and Pig are a high-level language for processing data.
Both are used for working with petabyte scale data [5][9].
Working at low-scale data can also be done with Hive or Pig.
But processing low-scale data can consume more time with
Hive or Pig rather than using other data processing software
such as MySQL. As the data grows larger, MySQL requires
more time to process the data until it reaches a point where
Hive or Pig is faster than MySQL.
But when exactly do users need to change from MySQL to
Hive or Pig for a faster processing time? This research
indicates to users when they can switch to Hive or Pig as their
rows of data become bigger. This test is done in a low-cost
hardware environment.
III. PROPOSED METHOD
There are three aspects that will determine the result: 1) the
data set file size (how many rows); 2) query statements; 3)
query average time. There are three data sets with the same
data model. The first data set is called ml100k (movie lens
100,000 rows) containing a total of 102,580 rows. The second
data set is called ml1m containing a total of 1,075,611 rows.
The last data set is called ml10m containing a total of
10,069,372 rows.
A. Hadoop Environment
For the Hadoop environment, there is one Hadoop master,
three Hadoop slaves, one Sqoop, one Hive, and one Pig as
shown in Fig 1. Sqoop only pulls the data from the MySQL
Server and imports directly to Hadoop Distributed File
System (HDFS). This is only for ease of use of importing data
to HDFS. Data replication is set to 2 in the HDFS
configuration.
B. MySQL Cluster Environment
MySQL Cluster has one management node, four data
2014 International Conference on Information, Communication Technology and System
978-1-4799-6858-9/14/$31.00 © 2014 IEEE ICTS 2014, Surabaya, Indonesia 297