International Journal of Software Engineering and Its Applications Vol. 10, No. 6 (2016), pp. 93-100 http://dx.doi.org/10.14257/ijseia.2016.10.6.08 ISSN: 1738-9984 IJSEIA Copyright ⓒ 2016 SERSC Performance Comparison of MySQL Cluster and Apache Spark for Big Data Applications Indira Bidari 1 , Sindhooja K 2 , Satyadhyan Chickerur 3 1 Department of Information Science and Engineering, B V Bhoomaraddi College of Engineering and Technology, Hubballi, Karnataka, India 2 Applied Materials Pvt. Ltd, Benguluru, India 3 Centre for High Performance Computing, K L E Technological University, Hubballi, Karnataka, India chickerursr@kletech.ac.in Abstract Working with data involves two major factors, storing the data and performing computations by accessing the data. MySQL is the first Database Management Software that provided an effective and efficient method for data storage and computations. However, with the huge amount of data that is getting generated every day from various fields, need for the advanced methods for managing and analyzing the big data is very much obvious. One of such platforms, which were developed exclusively for Big Data Analytics, is Apache Spark. Though MySQL is preferred for small amount of Data and Spark is meant for big data, many of the functionalities are found similar in both and they can be considered for a comparative study. In this work we have executed a set of queries with common functionalities for a dataset on both the frameworks. The obtained results are analyzed by visualizing aids to arrive at appropriate conclusion. Keywords: MySQL, RDD, Apache Spark 1. Introduction MySQL is the world’s most popular open source database software, with over 100 million copies of its software downloaded or distributed throughout its history. With its superior speed, reliability, and ease of use, MySQL [1] has become the preferred choice for most of the applications involving Data Analytics. Although MySQL still remains one of the most popular relational database management systems in the world because of its ease of use, readily available support for the supporters, sort of open source features, inexpensive; it’s recently been losing supporters. Some of the disadvantages of MySQL are stability issues, poor performance scaling, its development is not community driven and many more. The major one is the database size. Although it is theoretically scalable up to 8 TB, MySQL can’t work efficiently with large databases. This leads to the development of advanced platforms for managing huge amount of data. MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data ﬂow model that is not suitable for other popular applications. Thus, Apache Spark was designed and implemented. The main abstraction in Spark [2] is that of a resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition. Although RDDs are not a general shared memory abstraction, they