International Journal of Computer Applications (0975 – 8887) Volume 57– No.7, November 2012 22 Comparative Study of Fuzzy k-Nearest Neighbor and Fuzzy C-means Algorithms Pradeep Kumar Jena National Institute of Science and Technology, Berhampur, Odisha, India Subhagata Chattopadhyay Bankura Unnayani Institute of Engineering, Bankura-722146, West Bengal, India ABSTRACT Fuzzy clustering techniques handle the fuzzy relationships among the data points and with the cluster centers (may be termed as cluster fuzziness). On the other hand, distance measures are important to compute the load of such fuzziness. These are the two important parameters governing the quality of the clusters and the run time. Visualization of multidimensional data clusters into lower dimensions is another important research area to note the hidden patterns within the clusters. This paper investigates the effects of cluster fuzziness and three different distance measures, such as Manhattan distance (MH), Euclidean distance (ED), and Cosine distance (COS) on Fuzzy c-means (FCM) and Fuzzy k-nearest neighborhood (FkNN) clustering techniques, implemented on Iris and extended Wine data. The quality of the clusters is assessed based on (i) data discrepancy factor (i.e., DDF, proposed in this study), (ii) cluster size, (iii) its compactness, (iv) distinctiveness, (v) execution time taken, and (vi) cluster fuzziness (m) values. The study observes that FCM handles the cluster fuzziness better than FkNN. MH distance measure yields best clusters with both FCM and FkNN. Finally, best clusters are visualized using a Self Organizing Map (SOM). General terms: Fuzzy clustering algorithms, comparisons, datasets, distance measures Keywords: Fuzzy clusters; FkNN; FCM; Cluster fuzziness; Data discrepancy factor (DDF) 1 INTRODUCTION Clusters are defined as the groups of similar data points devoid of any predefined class labels. Clustering is a process of successfully partitioning a dataset into groups, where one group must be different from the other. It is an unsupervised process as the algorithms learn from observations rather than examples, which is seen in classification. Thus, clustering is useful to (i) explore the hidden pattern of any given dataset and (ii) model the data. The popularity of clustering techniques in machine learning is due to its inherent ability to handle different types of (i) attributes in a multidimensional data, (ii) noisy data, and (iii) users having no domain knowledge. Clusters are of two types, such as crisp and fuzzy. In crisp clusters, the cluster boundaries are well-defined and within the boundary a data point is grouped according to the crisp similarity it has with respect to the representative data or cluster center. Some popular crisp clustering techniques are K-means [1], K-medoid [2], Agglomerative and divisive [3] etc. On the other hand, in fuzzy clusters, the cluster boundary is ill-defined as the data points inside the clusters are chosen according to its degrees of belongingness (i.e. fuzzy memberships) with the clusters. Hence, fuzzy clusters are popular in partitioning the real-world data where the data-data relationships are usually subjective and non-linear in nature [4]. There are several fuzzy clustering techniques available, e.g. Fuzzy c-means (FCM) [5], Fuzzy k-nearest neighbor (FkNN) [6], Entropy-based fuzzy clustering (EFC) [7], Fuzzy ISODATA [8] and so forth. This paper, however, focuses on FCM and FkNN techniques. In both crisp and fuzzy clustering techniques, cluster centers play the key roles in grouping the data points, because these pose to be the most ideal representative data of the respective clusters. Cluster centers also provide information of the pattern stored within the cluster. These are also useful to measure compactness and distinctiveness of the clusters. Compactness denotes how closely the data points are located with respect to the cluster centers. Distinctiveness measures how far the clusters are lying from each other. A good clustering algorithm must be able to produce compact and distinct clusters. Similarity measure between the representative cluster center and the random data points (to be clustered) is the initial technique to iteratively cluster the similar data points into and exclude the dissimilar data. Distance measures are the most useful techniques to compute such dissimilarity. There are several distance measures, such as Euclidean (ED), Manhattan (MH), Cosine (COS), Mahalanobis, Hamming, and so on [9]. Generally, the distances between two multidimensional data points are calculated attribute-by-attribute [10]. Detail discussion of all the available techniques is beyond scope of this work. The second focus of this study is to investigate how three distance measures, such as ED, MH, and COS influence the overall clustering performances. Cluster visualization is an important method to directly display the clusters for interpreting the size, shape, compactness, distinctiveness etc. It is a challenge in case of clusters having multidimensional data points. The third focus of this study is to showcase the best clusters, obtained through the FCM and FkNN techniques using ED, MH and COS distances on a Self- organizing Map (SOM) [11]. 2 RELATED RESEARCH Fuzzy clustering techniques are quite popular in various research domains, such as engineering, economics and commerce, biometry and imaging, medical sciences and so on. This paper focuses on FCM and FkNN techniques, applied in various domains. Some recent studies on these two techniques are described below. 2.1 Works related to FkNN: FkNN has been successfully used in various domains of science, such as materials science [12], banking and finance [13] [14],