International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 04 | Apr -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2363
K-Means algorithm with different distance metrics in spatial data
mining with uses of NetBeans IDE 8.2
Ms. Kothariya Arzoo
1
, Asst. Prof. Kirit Rathod
2
1
M. Tech student, Computer Engineering, C.U. Shah College of engineering and technology, Gujarat, India
2
Asst. Prof , Computer Engineering, C.U. Shah College of engineering and technology, Gujarat, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract -Data mining is a process of finding useful
information from large database. Clustering is a process of
grouping the same characteristics elements in one group
(cluster) and while distinct characteristics elements in different
group (cluster).K-Means is very simple and very popular
clustering technique. In this paper we will do the experiments
with spatial data mining. Spatial data mining is the application
of data mining. In spatial data mining spatial or geographic
dataset is used. Distance metrics play very important role in
clustering technique. In this paper we will do the experiments
with the NetBeans IDE 8.2 and taking spatial data from Indian
government website. This paper includes implementation
analysis on k means clustering with different distance metrics
with taking spatial database.
Key Words: Clustering, Spatial Data mining, K-Means,
Distance Metrics
1. INTRODUCTION
Data mining [1] [2] involves process of finding useful
information. Data mining consists of steps of a) data
cleaning, b) data selection, c) data integration, d) data
mining, e) pattern evaluation and finally f) knowledge
representation [2]. Nowadays large amount of data
generating day by day but this all data are not useful for us.
So. Data mining is very essential for us in daily life. The data
Clustering [3] involves the process of dividing the same data
in one cluster and distinct data in different cluster so that
inter cluster property should high and intra cluster property
should low. These two are the clustering property. Clustering
is the popular research topic nowadays. Classification of
Clustering algorithm as: partitioning based clustering,
hierarchical based clustering, grid based clustering etc [4]. A
partitional clustering a simply a division of the set of data
objects into non-overlapping subsets such that each data
object is in exactly one subset. • A hierarchical clustering is a
set of nested clusters that are structured as a tree. Clustering
process involves feature selection, clustering algorithm,
cluster validation, result interpretation, knowledge [5]. K-
means [6] [7] is very simple clustering algorithm in data
mining. Clustering is unsupervised technique [3] [4] such
that there is no any test data and training data are available
by which we can predict our result. Clustering is totally
unsupervised method for data mining. There are many
research and review paper available for k means and its
variants [6]. There are many parameters for modifying k
means algorithm based centroid initialization, distance
metrics, improving accuracy. There are many limitations of k
means algorithm [8] like handling empty cluster, outlier
detection, distance metrics, number of cluster predefined,
cluster center chosen randomly etc. Spatial data mining [9]
[10] [11] requires specific resources to get the spatial
database in specific format. The application covered by
spatial data mining are geomarketing, environmental
studies, risk analysis, and so on. The aim of clustering is to
automatically find groups of instances that are similar to
each other. For example, in classroom there is cluster of
students with similarities in their birth date is in same
month. Using GIS, the user can query spatial data and
perform simple analytical tasks using programs or queries.
However, GIS are not designed to perform complex data
analysis or knowledge discovery [12]. There are many
distance metrics [13] [14] [15] are available like Euclidean,
Manhattan (city block), Chebychev, Minkowski etc.
1.1 K-Means Algorithm
The K-means [6] [7] algorithm involves randomly selecting
K initial centroids or mean where K is a user defined
number of desired clusters. For each of the object the
distance is calculated between center points and data points
and with minimum distance data, cluster is generated. This
data points are far from another cluster or group. This
computation is stop when no center points do not move any
more. K means algorithm work as follows:
1. Initialization by setting initial centroids with a
predefined k.
2. Cluster or group the data points in given k clusters.
3. Assign data or objects to nearest cluster center as
per distance function.
4. When all objects are assigned recalculate or update
the position of k centroids.
5. Repeat step 3 and 4 until the centroids no longer
move.
Diagrammatic representation of k means algorithm as
follows [17]: