IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. IV (Jan.-Feb. 2017), PP 114-121 www.iosrjournals.org DOI: 10.9790/0661-190104114121 www.iosrjournals.org 114 | Page Hadoop Based Big Data Clustering using Genetic & K-Means Algorithm Palak Sachar 1 , Vikas Khullar 2 1 (Student of Masters of Technology in Computer Science and Engineering) 2 (Assistant Professor in Computer Science and Engineering) CT Group of Institute, Jalandhar, India. Abstract : This is the era of huge and large sets of data or can say Big Data. Clustering of Big data plays several important roles for Big Data analytics. In this paper, we are introducing Big Data clustering algorithm by combining Genetic and K-Means algorithm using Hadoop framework. The major aim of this hybrid algorithm is to make clustering process faster and also raise the accuracy of resultant clusters. Keywords Big Data Analytics, Genetic Algorithm, Hadoop, K-Means, MapReduce, . I. Introduction Genetic algorithm is famous for optimization and K-Means algorithmis one of the best for data clustering. These techniques are able to obtain optimal results in global search space and produces outcomes in less time respectively [1,2]. Paper [3] is panoply of working of Genetic Algorithm for clustering data. Paper exhibits a comparison of parallel algorithm and sequential process in which Genetic Algorithm on Hadoop platform overpower the other in respect of Time and accuracy. These algorithms also had lacunas such as the requirement of prior detail knowledge of the various input parameters and K-means clustering algorithm get early convergence perspicacity. This prior knowledge requirement is essential to get desired outcomes and early convergence perspicacityis not correct according to algorithm requirement lead to poor results. Genetic and K- Means algorithm best fitted together for avoiding these problems during clustering to get optimal solution in lesser time.In present scenario Big Data made directed different techniques towards itself for handling lager Volume, higher Velocity and Variety of generated data. In this paper, we proposed, implemented and analyzed a hybrid approach of optimized clustering using Genetic and K-Means algorithm with the support of Hadoop MapReduce along with Mahout clustering libraries [3]. There are many more benefits as well which validate to endeavor.Social media is the only source to get large genuine data where we can verify our approach. So, we choose to work on Twitter based data as it is well suited to verify any approach. It is difficult to collect lest we have a great internet connection or superb patience. [4] II. Background a. Clustering Techniques In Data Mining, there are numerous Clustering Techniques out of which K-mean clustering is an important technique because of the computations being able to revolve around the user defined centroids and complete its task in no time [1, 14]. Another important clustering Technique is called hierarchal clusters which make the tree like structures. It divides the huge cluster into smaller ones based on similarities until it gets to the k-clusters. Other clustering techniques are DB scan and optical clustering techniques which are based on density theory of data. Optical Algorithm is superior to the former one because it overcomes the limitations of finding more relevant data into the clusters from the data having least dense characteristics [1, 10]. Apart from these, there is graph theory clustering techniques which helps in computing the clustering of data through graphs. Model-based clustering techniques involve Decision trees and networks. In 2001, Grid clustering techniques like fuzzy logics and an evolutionary algorithm were introduced. In a study by Akilesh and his team (2015), it was shown that a particle swarm optimization algorithm is better than K-means algorithm onto the Map-Reduce framework [6]. Similarly, the Genetic algorithm is a unique and growing technology in the research field [2]. Cellular genetic algorithm (7) was also used to do clustering of tweets and behaviour was seen on JAVA platform [18].In paper[3], one of the Evolutionary Algorithmi.e. a classic Genetic Algorithm has been pre-owned to do Clustering of Big Data on Hadoop platform which leads to give a better execution Time and accuracy as compare on JAVA platform. \