DOI: http://dx.doi.org/10.26483/ijarcs.v9i1.5090 Volume 9, No. 1, January-February 2018 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info © 2015-19, IJARCS All Rights Reserved 199 ISSN No. 0976-5697 GENERATION OF A HYBRID CLUSTERING ALGORITHM FOR BIG DATA Deepak Ahlawat PhD Research Scholar, MMU, Sadopur Ambala, Haryana, India Dr.Deepali Gupta HOD CSE, MMU, Sadopur Ambala, Haryana, India Abstract: In this paper, a Hybrid Algorithm for clustering big data is proposed which is based on Rank Similarity. Rank Similarity is calculated by taking the sum of both Cosine and Gaussian Similarity. Proposed Technique is compared with the existing technique which is based on Cosine Similarity only. Comparison is done by taking parameters precision, recall, F-Measure, and accuracy. Results are evaluated on Java Netbeans 8.2. Keywords: Cosine Similarity, Gaussian Similarity, Rank Similarity. 1. INTRODUCTION 1.1 Big Data As stated by IBM, with pervasive handheld devices, communication of machine-to-machine, online/mobile social networks, 2.5 quintillion bytes of data is created every day from the last two years. It became tough for the users to store, capture, manage, analyse, share, and visualize with related data and processing tools. Because of this, big data concept has been proposed. The capability for data generation has never been enormous and powerful since the development of the IT (Information Technology) in the late 19 century. As another example, dated on October 4, 2012, first presidential debate between President Obama and Prime Minister Mitt Romney has debated all these tweets and triggered more than 10 million tweets in two hours and generates the discussion at the specific moment, in fact, reveals the public interest with the discussion on Medicare and vouchers. However, the term ‘big data’ is still vague. As shown in Wikipedia, Big Data is a data set that contains all the terms of any, large and complex data, difficult to use traditional data processing applications for processing. Widely accepted definition belongs to IDC: ‘big data technology describes a new generation of technology and architecture, that aims to achieve high-speed capture, discovery and / or economic analysis to extract value from large amounts of data’ to explore the use of large and exceptional value data that must increase the risk of security privacy. For example, ‘Amazon’ monitor user’s shopping preferences. Facebook also seems to attract all the information, as well as our social relationships. Mobile operators not only know to whom the person is talking but the availability of someone to the user. The promising values are in sighted to the one that analyses and the signs depict the further surge in another’s storage, re-usage and gathering of the personal data. If the age of the Internet threat to security and privacy, then the era of big data will endanger them. Before moving ahead for what big data is, a moment is required to look at the below diagram by Hewlett-Packard: Fig.1. Amount of Data Volume 1.2 Clustering Grouping of data in different sets or classes or in clusters is known as the Clustering. The data which is placed in one cluster is similar to other data in that cluster; also this data is dissimilar to data present in other clusters. Dissimilarities can be calculated according to various attributes.There are various distance measures which describe the dissimilarity in the various data objects. These dissimilarity attributes are then used to construct a Dissimilarity matrix. Clustering of data is useful in various fields like, data mining, statistics, biology, and machine learning.In literature, numerous clustering algorithms are discussed. Every algorithm has its own pros and cons; also they find there use differently in different situations [1]: Typically clustering algorithms are categorized in the following categories: 1. Partitioning Methods. 2. Density-Based Methods. 3. Hierarchical Methods. 4. Grid-Based Methods. 5. Supervised and Unsupervised Learning Based Methods. In this paper, basically two important (Partitioning and Density-Based) Methods are exploited to do the clustering. Partitioning Methods: Suppose there is a database containing n objects or data tuples and the task is to divide these data objects into different clusters, say, K clusters, where k ≤ n. Then, the partitioning method is used to do there clustering according to the dissimilarity between various data objects. The objects which are similar are in same group and which are dissimilar are placed in different groups. There are some essential requirements which should be met by the clustering algorithm, these are: (1) the cluster must not be empty, i.e., every cluster should contain at least one data object, and (2) no data object is shared among