2014 International Conference on Information, Communication Technology and System 978-1-4799-6858-9/14/$31.00 ©2014 IEEE ICTS 2014, Surabaya, Indonesia Processing Performance on Apache Pig, Apache Hive and MySQL Cluster Ammar Fuad, Alva Erwin, Heru Purnomo Ipung Information Technology, Swiss German University Edutown BSD City, Tangerang 15339, Indonesia 1 ammar.fuad[at]student.sgu.ac.id 2 alva_erwin[at]yahoo.com 3 heru.ipung[at]sgu.ac.id Abstract—MySQL Cluster is a famous clustered database that is used to store and manipulate data. The problem with MySQL Cluster is that as the data grows larger, the time required to process the data increases and additional resources may be needed. With Hadoop and Hive and Pig, processing time can be faster than MySQL Cluster. In this paper, three data testers with the same data model will run simple queries and to find out at how many rows Hive or Pig is faster than MySQL Cluster. The data model taken from GroupLens Research Project [12] showed a result that Hive is the most appropriate for this data model in a low-cost hardware environment. Keywords— Hadoop; Hive; Pig; MySQL; MySQL Cluster; Processing big data; I. INTRODUCTION Hadoop is a popular open-source implementation of MapReduce that is used by academics, goverments, and industrial organizations. Hadoop can be used for storing large data and for processing data such as data mining, report generation, file analysis, web indexing, and bioinformatic research [2]. MySQL Cluster is a MySQL server with one or more data storages and management servers to configure the cluster and data replication. MySQL Cluster provides 99.999% availability to the data. MySQL Cluster is designed for distributed node architecture with no single point of failure. It consists of multiple nodes that are distributed across machines to make sure the system can work, even in case a node having a problem such as network failure [11]. Apache Hive and Apache Pig are open source programs for analyzing large data sets in a high-level language. Apache Pig is a simple query algebra that lets the user declare data transformation to files or groups of files. Hive is data warehouse software that facilitates queries and manages a large data set in distributed storage. Hive and Pig run on top of Hadoop [5]-[9]. When it comes to querying large data sets on MySQL Cluster, it can take seconds (assuming that the query is pretty complex). As the data grows larger, the time required to process the data increases too. This is where Hadoop fits in with Hive and Pig. This paper presents the processing time of Hive, Pig, and MySQL Cluster on a simple data model with simple queries while the data is growing. Section 3 discusses a proposed method. Section 4 shows the results and explanations. And the last section, section 5 provides a conclusion and possible future work. II. RELATED WORKS Hive and Pig are a high-level language for processing data. Both are used for working with petabyte scale data [5][9]. Working at low-scale data can also be done with Hive or Pig. But processing low-scale data can consume more time with Hive or Pig rather than using other data processing software such as MySQL. As the data grows larger, MySQL requires more time to process the data until it reaches a point where Hive or Pig is faster than MySQL. But when exactly do users need to change from MySQL to Hive or Pig for a faster processing time? This research indicates to users when they can switch to Hive or Pig as their rows of data become bigger. This test is done in a low-cost hardware environment. III. PROPOSED METHOD There are three aspects that will determine the result: 1) the data set file size (how many rows); 2) query statements; 3) query average time. There are three data sets with the same data model. The first data set is called ml100k (movie lens 100,000 rows) containing a total of 102,580 rows. The second data set is called ml1m containing a total of 1,075,611 rows. The last data set is called ml10m containing a total of 10,069,372 rows. A. Hadoop Environment For the Hadoop environment, there is one Hadoop master, three Hadoop slaves, one Sqoop, one Hive, and one Pig as shown in Fig 1. Sqoop only pulls the data from the MySQL Server and imports directly to Hadoop Distributed File System (HDFS). This is only for ease of use of importing data to HDFS. Data replication is set to 2 in the HDFS configuration. B. MySQL Cluster Environment MySQL Cluster has one management node, four data 2014 International Conference on Information, Communication Technology and System 978-1-4799-6858-9/14/$31.00 © 2014 IEEE ICTS 2014, Surabaya, Indonesia 297