I.J. Intelligent Systems and Applications, 2013, 03, 37-49 Published Online February 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2013.03.04 Copyright © 2013 MECS I.J. Intelligent Systems and Applications, 2013, 03, 37-49 Efficient Data Clustering Algorithms: Improvements over Kmeans Mohamed Abubaker, Wesam Ashour Dept. of Computer Engineering, Islamic University of Gaza, Gaza, Palestine mabubaker@hotmail.com;washour@iugaza.edu.ps Abstract— This paper presents a new approach to overcome one of the most known disadvantages of the well-known Kmeans clustering algorithm. The problems of classical Kmeans are such as the problem of random initialization of prototypes and the requirement of predefined number of clusters in the dataset. Randomly initialized prototypes can often yield results to converge to local rather than global optimum. A better result of Kmeans may be obtained by running it many times to get satisfactory results. The proposed algorithms are based on a new novel definition of densities of data points which is based on the k-nearest neighbor method. By this definition we detect noise and outliers which affect Kmeans strongly, and obtained good initial prototypes from one run with automatic determination of K number of clusters. This algorithm is referred to as Efficient Initialization of Kmeans (EI- Kmeans). Still Kmeans algorithm used to cluster data with convex shapes, similar sizes, and densities. Thus we develop a new clustering algorithm called Efficient Data Clustering Algorithm (EDCA) that uses our new definition of densities of data points. The results show that the proposed algorithms improve the data clustering by Kmeans. EDCA is able to detect clusters with different non-convex shapes, different sizes and densities. Index Terms— Data Clustering, Random Initialization, Kmeans, K-Nearest Neighbor, Density, Noise, Outlier, Data Point I. Introduction Data clustering techniques are an important aspect used in many fields such as data mining [1], pattern recognition and pattern classification [2], data compression [3], machine learning [4], image analysis [5], and bioinformatics [6]. The purpose of clustering is to group data points into clusters in which the similar data points are grouped in the same cluster while dissimilar data points are in different clusters. The high quality of clustering is to obtain high intra-cluster similarity and low inter-cluster similarity. The clustering problems can be categorized into two main types: fuzzy clustering and hard clustering. In fuzzy clustering, data points can belong to more than one cluster with probabilities [7] which indicate the strength of relationships between the data points and a particular cluster. One of the most widely used fuzzy clustering algorithms is fuzzy c-mean algorithm [8]. In hard clustering, data points are divided into distinct clusters, where each data point can belong to one and only one cluster. The hard clustering is subdivided into hierarchical and partitional algorithms. Hierarchical algorithms create nested relationships of clusters which can be represented as a tree structure called dendrogram [9]. These algorithms can be divided into agglomerative and divisive hierarchical algorithms. The agglomerative hierarchical clustering starts with each data point in a single cluster. Then it repeats merging the similar pairs of clusters until all of the data points are in one cluster, such as complete linkage clustering [10] and single linkage clustering [11]. The divisive hierarchical algorithm reverses the operations of agglomerative clustering, it starts with all data points in one cluster and repeats splitting large clusters into smaller ones until each data point belong to a single cluster such as DIANA clustering algorithm [12]. Partitional clustering algorithm divides the data set into a set of disjoint clusters such as Kmeans [13], PAM [12] and CLARA [12]. One of the most well-known unsupervised learning algorithms for clustering datasets is Kmeans algorithm [12]. The Kmeans clustering is the most widely used [14] due to its simplicity and efficiency in various fields. It is also considered as the top ten algorithms in data mining [15]. The Kmeans algorithm works as follows: 1. Select a set of initial k prototypes or means throughout a data set, where k is a user-defined parameter represents the number of clusters in the data set. 2. Assign each data point in a data set to its nearest prototypes m. 3. Update each prototype according to the average of data points assigned to it. 4. Repeat step 2 and 3 until convergence.