International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 03 Issue: 01 | Jan-2016 www.irjet.net p-ISSN: 2395-0072
© 2016, IRJET | Impact Factor value: 4.45 | ISO 9001:2008 Certified Journal | Page 1088
Clustering of Big Data Using Different Data-Mining Techniques
Manisha R. Thakare, Prof. S. W. Mohod, Prof. A. N. Thakare
1
M. tech, Computer science & engineering, B.D. College of Engineering Wardha, Maharashtra, India
2 3
Assistant Professor, Computer science & engineering, B.D. College of Engineering Wardha, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract -There exist large amounts of heterogeneous
digital data. The phenomenon of Big data which will be
examined. The Big data analytics has been launched.
Big data is large volume, heterogeneous, distributed
data. Big data applications where data collection has
grown continuously, it is expensive to manage, capture
or extract and process data using existing software
tools. Fast retrieval of the relevant information from
databases has always been a significant issue.
Clustering is a main task of exploratory data analysis
and data mining applications. Clustering is one of the
data mining techniques for dividing dataset into
groups. Clustering is a kind of unsupervised data
mining technique.
Key Words: Data Mining, Clustering, Classification,
Clustering Algorithms, Big Data, Map-Reduce.
1. INTRODUCTION
Big data is a largest buzz phrases in domain of IT, new
technologies of personal communication driving the big
data new trend and internet population grew day by day.
The need of big data generated from the large companies
like Facebook, yahoo, Google, YouTube etc for the purpose
of analysis of enormous amount of data which is in
unstructured form converted into structured form. The
need of Big data analytics which is stored in relational
database systems in terms of five parameters-variety,
volume, value, veracity and velocity.
Volume: Data is ever-growing day by day of all types ever
MB, PB, YB, ZB, KB, TB of information. The data results
into large files. Excessive volume of data is main issue of
storage. This main issue is resolved by reducing storage
cost. Data volumes are expected to grow 50 times by 2020.
Variety: Data sources are extremely heterogeneous. The
files comes in various formats and of any type, it may be
structured or unstructured such as text, audio, videos, log
files and more.
Velocity: The data comes at high speed. Sometimes 1
minute is too late so big data is time sensitive. Some
organizations data velocity is main challenge.
Value: Value is main buzz for big data because it is
important for business, IT infrastructure system to store
large amount of values in database. It is a most important
v in big data.
Veracity: The increase in the range of values typical of a
large data set. When we dealing with high volume, velocity
and variety of data, the all of data are not going 100%
correct, there will be dirty data.
1.1 Data Mining Techniques
Data mining having many type of techniques like
clustering, classification, neural network etc but in this
paper we are consider only two techniques.
1.1.1 Clustering
Clustering is the most significant task of data mining. It is
an unsupervised method of machine learning application.
In clustering the classes are divided according to class
variable. Two important topics are: (1) Different ways to
group a set of objects into a set cluster. (2) Types of
clusters. The result of the cluster analysis is a number of
heterogeneous groups with homogeneous contents. The
first document or object of a cluster is defined as the
initiator of that cluster. The initiator is called the cluster
seed.
Fig1: Cluster Analysis