Designing an Efficient Distributed Algorithm for Big Data Analytics: Issues and Challenges Mohammed S. Al-kahtani Dept. of Computer Engg., Prince Sattam bin Abdulaziz University, Saudi Arabia Email: alkahtani@psau.edu.sa Lutful Karim School of ICT, Seneca College of Applied Arts & Technology, Toronto, Canada Email: lutful.karim@senecacollege.ca Abstract - As designing computationally efficient distributed algorithms is very important for analyzing big data this paper presents the current state-of-the-art research in designing distributed algorithms of big data analytics. More specifically, this papers presents a comprehensive survey on existing distributed algorithms of big data analytics – present working principle of existing algorithms, their advantages, limitations and also compare these algorithms in terms of several features. This paper also presents issues and research challenges that may arise to design an efficient distributed algorithm for big data analytics and proposes some solutions to address such challenges. Our research find that these algorithms support parallel processing and designed based on the MapReduce paradigm of big data for a particular application. Keywords: Big Data, Distributed Algorithm, MapReduce, DBMS, Commodity Hardware. I. INTRODUCTION As a large amount of data are being produced from social media, cloud, computer networks, content delivery networks and other emerging technologies and transmitted through Internet Big Data analytics have achieved a widespread popularity. At the same time, analyzing big data has proven to be challenging as traditional computing systems are not able to handle them. Three attributes of big data, velocity, volume and variety, namely 3V also reflects such challenges. Velocity deals with the issue of how quickly large amounts of data are being sent in. Volume deals with the size and amount of data being stored and processed. Variety represents different formats of data, which are mostly unstructured. A good example of 3V attributes of Big Data would be Instagram, which has over 400 million active users with 80 million upload a day on an average. These uploads include multiple formats of pictures and videos [1]. To cope with the challenges big data has emerged, horizontal scaling of a computing system is much more important than the vertical scaling [2]. This means that one computer is not built to be more powerful but the work is spread out over many more less powerful machines. This necessitates the use of distributed algorithms to process big data. Hence, designing efficient distributed algorithms for processing big data is significantly important to achieve computational efficiency. Several distributed algorithms [1-14] have been designed to process big data. These algorithms are mostly designed to process data of a specific application. These algorithms are also designed based on the MapReduce Paradigm. The MapReduce runs on a cluster of commodity machines and thus, can be used in Hadoop operations and functionalities for large scale data processing. Among many, the Parallel IdeaGraph Algorithm uses MapReduce paradigm to handle big data challenges by implementing a parallel distribution of IdeaGraph [2]. Another Algorithm is Probabilistic Latent Semantic analysis (PLSA) [3] that implements a parallel method to train PLSA under the MapReduce framework. This algorithm addresses the scalability issues in PLSA. Item-based Collaborative-filtering algorithm [4] is a very effective and computationally efficient algorithm in MapReduce Paradigm that uses “hotweight” as the weight [4]. Evolutionary algorithm called the Grouping Genetic Algorithm solves the problem of grouping and is found in the schema optimization in HBase 1 . In addition to these algorithms, Locality-aware Scheduling Algorithm (LaSA) performs data locality assignment in Hadoop scheduler to enhance the performance of big data applications [5]. Parallel Two-Pass MDL (PTP- MDL) algorithm, Scalable Nearest Neighbor and Convex Optimization are other algorithms that reduce the computational, storage, and communications bottlenecks [6]. However, these algorithms are application dependent and hence, cannot be used for processing data of all applications. These algorithms only works in commodity machines and cannot be used in low powered wireless nodes such as sensors. This paper provides a comprehensive survey on distributed algorithms of big data i.e., present working principle of existing algorithms, their advantages, limitations and also compare these algorithms in terms of several features. This paper also presents several challenges and issues to design an efficient distributed algorithm and proposes some solutions for future improvement. The rest of the paper is organized as follow. Section 2 defines some terminologies and presents MapReduce Paradigm. Section 3 presents a comprehensive review on distributed algorithms of big data. Section 4 analyzes the algorithms, classifies and compares them. Section 5 identifies research challenges to design an efficient distributed algorithm and proposal for improvement. Sections 6 summarizes our research work in this paper. II. PRELIMINARIES To understand the working principle of distributed algorithms, the MapReduce framework of Big Data, which has been popularized by Google, is presented first. MapReduce is a scalable and fault-tolerant data processing tool that processes a large amount of data in International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 11, November 2017 23 https://sites.google.com/site/ijcsis/ ISSN 1947-5500