International Journal of Computer Applications (0975 – 8887) Volume 78 – No.5, September 2013 21 A Comparative Study on K Means and PAM Algorithm using Physical Characters of Different Varieties of Mango in India Bhaskar Mondal Department of Computer Science and Engineering National Institute of Technology Jamshedpur Jamshedpur, India- 831014 J. Paul Choudhury, Ph.D Department of Information Technology Kalyani Government Engineering College Kalyani, West Bengal, India- 741235 ABSTRACT Clustering is the most important and popular technique for finding pattern and relationships in databases. In this paper a comparative study has been done on the clustering techniques like k-means and k-mediod (PAM) with difference distance measures to classify the different varieties of mango based on physical characters of fruit. As the purity of result of a clustering algorithm depend upon the distance measure technique used in that algorithm we have validate the result using different distance measure also. Classification of agricultural data is still remains a challenge due to its high dimension and noise. This type of study may be helpful for the agricultural research as well as for the field of science and technology. General Terms Clustering. Keywords Clustering, k-means, k-mediod, PAM, distance. 1. INTRODUCTION The clustering techniques are proposed for partitioning a collection of data objects into k number of subsets or “clusters ( ) where { }” so that objects are more closely related to one another in same cluster than objects assigned to different clusters. Grouping is done on the basis of similarities or dissimilarities (distance, ) between objects [2]. The number of groups (k) may be user defined and it’s an unsupervised technique as no pre classified data is provided as training set. Clustering can be used to discover interesting patterns in the data, or to verify pureness of predefined classes. There are various examples of both these applications in the microarray literature. [1][10]. It is important to have knowledge of difference between clustering and classification. The classification techniques are supervised and some collection of pre-classified data objects should be provided, the problem is to label a newly encountered data records. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. The clustering method may roughly divide into two types namely partitioning and hierarchical methods. In partitioning method classes are mutually exclusive each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative. k-means[4] Each cluster is represented by the center of the cluster and k-medoids or PAM (Partition around medoids) [3] Each cluster is represented by one of the objects in the cluster are some hierarchical clustering methods. On the other hand the hierarchical clustering methods are most commonly used. There are two types of hierarchical methods agglomerative and divisive method. The construction of an agglomerative hierarchical clustering, it repetitively finds a pair of closest points and merges them into one cluster, where a point is either an individual object or a cluster of objects until only one cluster remains. The hierarchy is build up in a series of N-1 agglomerations. The divisive methods starts with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster. The divisive method is less popular one. The different types of mangos are harvested within a year. One common question is that, how does mango of all corn are categorized by its size? The size of mango is dependent on different parameters like nature of fruit weight, length, breadth, width, stone weight, peel weight, and presentence of pulp[13][14]. Here the comparative study of clustering methods has been done base on physical characters of fruits of different varieties of mango available in gangetic West Bengal. Comparison is made in respect accuracy and ability to handle high dimension of data. Section II describes preliminaries of the difference distance measures k-means and k-mediod clustering algorithms. The details of the proposed scheme are described in section III. Section IV and V presents the experimental results and security analysis. The conclusion and future scope of proposed scheme are presented in Section VI. 2. PRELIMINARIES In this paper a comparative study has been done on the clustering algorithm like k-means and k-mediod (PAM) with difference distance measures to classify the agricultural (mango) data set. For measuring distance Euclidean distance, City block (Manhattan) distance, Chebyshev Distance, Minkowski Distance of Order 2 and 3, Bray Curtis (Sorensen) distances are used. The author will like to put the distance measure techniques first as distance calculation is the most important step of clustering algorithm. 2.1. Euclidean distance: Euclidean distance (Minkowski Distance of Order 2) gives distance between two points on Cartesian coordinate. It