International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 03 Issue: 01 | Jan-2016 www.irjet.net p-ISSN: 2395-0072 © 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 1088 Clustering of Big Data Using Different Data-Mining Techniques Manisha R. Thakare, Prof. S. W. Mohod, Prof. A. N. Thakare 1 M. tech, Computer science & engineering, B.D. College of Engineering Wardha, Maharashtra, India 2 3 Assistant Professor, Computer science & engineering, B.D. College of Engineering Wardha, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract -There exist large amounts of heterogeneous digital data. The phenomenon of Big data which will be examined. The Big data analytics has been launched. Big data is large volume, heterogeneous, distributed data. Big data applications where data collection has grown continuously, it is expensive to manage, capture or extract and process data using existing software tools. Fast retrieval of the relevant information from databases has always been a significant issue. Clustering is a main task of exploratory data analysis and data mining applications. Clustering is one of the data mining techniques for dividing dataset into groups. Clustering is a kind of unsupervised data mining technique. Key Words: Data Mining, Clustering, Classification, Clustering Algorithms, Big Data, Map-Reduce. 1. INTRODUCTION Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data new trend and internet population grew day by day. The need of big data generated from the large companies like Facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data which is in unstructured form converted into structured form. The need of Big data analytics which is stored in relational database systems in terms of five parameters-variety, volume, value, veracity and velocity. Volume: Data is ever-growing day by day of all types ever MB, PB, YB, ZB, KB, TB of information. The data results into large files. Excessive volume of data is main issue of storage. This main issue is resolved by reducing storage cost. Data volumes are expected to grow 50 times by 2020. Variety: Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured or unstructured such as text, audio, videos, log files and more. Velocity: The data comes at high speed. Sometimes 1 minute is too late so big data is time sensitive. Some organizations data velocity is main challenge. Value: Value is main buzz for big data because it is important for business, IT infrastructure system to store large amount of values in database. It is a most important v in big data. Veracity: The increase in the range of values typical of a large data set. When we dealing with high volume, velocity and variety of data, the all of data are not going 100% correct, there will be dirty data. 1.1 Data Mining Techniques Data mining having many type of techniques like clustering, classification, neural network etc but in this paper we are consider only two techniques. 1.1.1 Clustering Clustering is the most significant task of data mining. It is an unsupervised method of machine learning application. In clustering the classes are divided according to class variable. Two important topics are: (1) Different ways to group a set of objects into a set cluster. (2) Types of clusters. The result of the cluster analysis is a number of heterogeneous groups with homogeneous contents. The first document or object of a cluster is defined as the initiator of that cluster. The initiator is called the cluster seed. Fig1: Cluster Analysis