I.J. Intelligent Systems and Applications, 2013, 03, 37-49
Published Online February 2013 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2013.03.04
Copyright © 2013 MECS I.J. Intelligent Systems and Applications, 2013, 03, 37-49
Efficient Data Clustering Algorithms:
Improvements over Kmeans
Mohamed Abubaker, Wesam Ashour
Dept. of Computer Engineering, Islamic University of Gaza, Gaza, Palestine
mabubaker@hotmail.com;washour@iugaza.edu.ps
Abstract— This paper presents a new approach to
overcome one of the most known disadvantages of the
well-known Kmeans clustering algorithm. The
problems of classical Kmeans are such as the problem
of random initialization of prototypes and the
requirement of predefined number of clusters in the
dataset. Randomly initialized prototypes can often yield
results to converge to local rather than global optimum.
A better result of Kmeans may be obtained by running
it many times to get satisfactory results. The proposed
algorithms are based on a new novel definition of
densities of data points which is based on the k-nearest
neighbor method. By this definition we detect noise and
outliers which affect Kmeans strongly, and obtained
good initial prototypes from one run with automatic
determination of K number of clusters. This algorithm
is referred to as Efficient Initialization of Kmeans (EI-
Kmeans). Still Kmeans algorithm used to cluster data
with convex shapes, similar sizes, and densities. Thus
we develop a new clustering algorithm called Efficient
Data Clustering Algorithm (EDCA) that uses our new
definition of densities of data points. The results show
that the proposed algorithms improve the data clustering
by Kmeans. EDCA is able to detect clusters with
different non-convex shapes, different sizes and
densities.
Index Terms— Data Clustering, Random Initialization,
Kmeans, K-Nearest Neighbor, Density, Noise, Outlier,
Data Point
I. Introduction
Data clustering techniques are an important aspect
used in many fields such as data mining [1], pattern
recognition and pattern classification [2], data
compression [3], machine learning [4], image analysis
[5], and bioinformatics [6].
The purpose of clustering is to group data points into
clusters in which the similar data points are grouped in
the same cluster while dissimilar data points are in
different clusters. The high quality of clustering is to
obtain high intra-cluster similarity and low inter-cluster
similarity.
The clustering problems can be categorized into two
main types: fuzzy clustering and hard clustering. In
fuzzy clustering, data points can belong to more than
one cluster with probabilities [7] which indicate the
strength of relationships between the data points and a
particular cluster.
One of the most widely used fuzzy clustering
algorithms is fuzzy c-mean algorithm [8]. In hard
clustering, data points are divided into distinct clusters,
where each data point can belong to one and only one
cluster. The hard clustering is subdivided into
hierarchical and partitional algorithms. Hierarchical
algorithms create nested relationships of clusters which
can be represented as a tree structure called dendrogram
[9]. These algorithms can be divided into agglomerative
and divisive hierarchical algorithms. The agglomerative
hierarchical clustering starts with each data point in a
single cluster. Then it repeats merging the similar pairs
of clusters until all of the data points are in one cluster,
such as complete linkage clustering [10] and single
linkage clustering [11]. The divisive hierarchical
algorithm reverses the operations of agglomerative
clustering, it starts with all data points in one cluster and
repeats splitting large clusters into smaller ones until
each data point belong to a single cluster such as
DIANA clustering algorithm [12].
Partitional clustering algorithm divides the data set
into a set of disjoint clusters such as Kmeans [13], PAM
[12] and CLARA [12].
One of the most well-known unsupervised learning
algorithms for clustering datasets is Kmeans algorithm
[12]. The Kmeans clustering is the most widely used
[14] due to its simplicity and efficiency in various fields.
It is also considered as the top ten algorithms in data
mining [15]. The Kmeans algorithm works as follows:
1. Select a set of initial k prototypes or means
throughout a data set, where k is a user-defined
parameter represents the number of clusters in the
data set.
2. Assign each data point in a data set to its nearest
prototypes m.
3. Update each prototype according to the average of
data points assigned to it.
4. Repeat step 2 and 3 until convergence.