DOI: http://dx.doi.org/10.26483/ijarcs.v9i1.5090
Volume 9, No. 1, January-February 2018
International Journal of Advanced Research in Computer Science
RESEARCH PAPER
Available Online at www.ijarcs.info
© 2015-19, IJARCS All Rights Reserved 199
ISSN No. 0976-5697
GENERATION OF A HYBRID CLUSTERING ALGORITHM FOR BIG DATA
Deepak Ahlawat
PhD Research Scholar, MMU, Sadopur
Ambala, Haryana, India
Dr.Deepali Gupta
HOD CSE, MMU, Sadopur
Ambala, Haryana, India
Abstract: In this paper, a Hybrid Algorithm for clustering big data is proposed which is based on Rank Similarity. Rank Similarity is
calculated by taking the sum of both Cosine and Gaussian Similarity. Proposed Technique is compared with the existing technique which
is based on Cosine Similarity only. Comparison is done by taking parameters precision, recall, F-Measure, and accuracy. Results are
evaluated on Java Netbeans 8.2.
Keywords: Cosine Similarity, Gaussian Similarity, Rank Similarity.
1. INTRODUCTION
1.1 Big Data
As stated by IBM, with pervasive handheld devices,
communication of machine-to-machine, online/mobile
social networks, 2.5 quintillion bytes of data is created every
day from the last two years. It became tough for the users to
store, capture, manage, analyse, share, and visualize with
related data and processing tools. Because of this, big data
concept has been proposed.
The capability for data generation has never been enormous
and powerful since the development of the IT (Information
Technology) in the late 19 century. As another example,
dated on October 4, 2012, first presidential debate between
President Obama and Prime Minister Mitt Romney has
debated all these tweets and triggered more than 10 million
tweets in two hours and generates the discussion at the
specific moment, in fact, reveals the public interest with the
discussion on Medicare and vouchers. However, the term
‘big data’ is still vague. As shown in Wikipedia, Big Data is
a data set that contains all the terms of any, large and
complex data, difficult to use traditional data processing
applications for processing. Widely accepted definition
belongs to IDC: ‘big data technology describes a new
generation of technology and architecture, that aims to
achieve high-speed capture, discovery and / or economic
analysis to extract value from large amounts of data’ to
explore the use of large and exceptional value data that must
increase the risk of security privacy. For example, ‘Amazon’
monitor user’s shopping preferences. Facebook also seems
to attract all the information, as well as our social
relationships. Mobile operators not only know to whom the
person is talking but the availability of someone to the user.
The promising values are in sighted to the one that analyses
and the signs depict the further surge in another’s storage,
re-usage and gathering of the personal data. If the age of the
Internet threat to security and privacy, then the era of big
data will endanger them. Before moving ahead for what big
data is, a moment is required to look at the below diagram
by Hewlett-Packard:
Fig.1. Amount of Data Volume
1.2 Clustering
Grouping of data in different sets or classes or in clusters is
known as the Clustering. The data which is placed in one
cluster is similar to other data in that cluster; also this data is
dissimilar to data present in other clusters. Dissimilarities
can be calculated according to various attributes.There are
various distance measures which describe the dissimilarity
in the various data objects. These dissimilarity attributes are
then used to construct a Dissimilarity matrix. Clustering of
data is useful in various fields like, data mining, statistics,
biology, and machine learning.In literature, numerous
clustering algorithms are discussed. Every algorithm has its
own pros and cons; also they find there use differently in
different situations [1]:
Typically clustering algorithms are categorized in the
following categories:
1. Partitioning Methods.
2. Density-Based Methods.
3. Hierarchical Methods.
4. Grid-Based Methods.
5. Supervised and Unsupervised Learning Based
Methods.
In this paper, basically two important (Partitioning and
Density-Based) Methods are exploited to do the clustering.
Partitioning Methods: Suppose there is a database
containing n objects or data tuples and the task is to divide
these data objects into different clusters, say, K clusters,
where k ≤ n. Then, the partitioning method is used to do
there clustering according to the dissimilarity between
various data objects. The objects which are similar are in
same group and which are dissimilar are placed in different
groups. There are some essential requirements which should
be met by the clustering algorithm, these are: (1) the cluster
must not be empty, i.e., every cluster should contain at least
one data object, and (2) no data object is shared among