International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-3, Issue-2, December 2013 341 Review of Clustering Algorithm for Categorical Data Poonam M. Bhagat, Prasad S. Halgaonkar, Vijay M. Wadhai Abstract: Clustering is a partition of data into a group of similar or dissimilar data points and each group is a set of data points called clusters. Clustering is an unsupervised learning with no predefined class label for the data points. Clustering is considered an important tool for data mining. Clustering has many applications such as pattern recognition, image processing, market analysis, World Wide Web and many others. Categorical data are groups of categories and each value represents some category. The problem of clustering categorical data is solved by the use of the cluster ensemble approach, but this technique generates a final data partition with imperfect information. The ensemble-information matrix that is the binary cluster association matrix content presents only cluster-data point relations with many entries being left unknown and which decrease the quality of the whole data partition. To avoid the degradation of the final data partition, a new approach of link- based is presented which includes the refined cluster association matrix. It maintains cluster to cluster relation and helps to improve quality of the final data partition result by determining the unknown entries through measuring similarity between clusters in an ensemble. The cluster ensemble combines multiple data partitions from different clustering algorithms into a single clustering solution to improve the robustness, accuracy and quality of the clustering result. Index Terms- Clustering, categorical, link-based, ensemble I. INTRODUCTION Clustering is a division of data into a group of data points, similar data points are in one group called cluster and dissimilar data points are in another cluster. The Fig.1 shows clustering, in which identify the three clusters into which the data can be divided. Here is the similarity criterion is distance two or more objects or data points belong to the same cluster if they are close according to a given distance then this is called distance-based clustering. While choosing any clustering algorithm some important requirements are required like: Robust Scalability Capability to deal with different types of attributes Handling outliers and noise High Dimensionality Usability and Compatibility Categorical data is a collection of categories and each value represents some category, categorical data is also called as qualitative data which in the form of unordered manner. Categorical data further classified into two types that are nominal and ordinal. Nominal means related to names in which data points are in unordered categories such as marital status, hair color. Ordinal in which order is essential such as exam rank. Manuscript received December, 2013 Poonam M. Bhagat, Department of Computer Engg. MITCOE-Pune University, India Prasad S. Halgaonkar, Department of Computer Engg. MITCOE-Pune University, India Vijay M. Wadhai, Principal MITCOE-Pune University, India The idea of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a particular problem. The clusters found by different algorithms have varied a lot in their belongings, and on the basis of belongings it helps to understand these various clusters differences between the various algorithms. The clustering is mainly used in data compression in image processing. Clustering of categorical data is a tremendously difficult task if the number of items or attributes involved increase. Fig. 1 Clustering There are some issues with the existing clustering technique are mentioned as follows: • Current clustering techniques do not refer all the requirements effectively. • Dealing with a large number of dimensions and large number of data items can be difficult because of time complexity; • The distance base clustering efficiency mainly depends on the definition of distance. • Clustering algorithm final clustering results can be deduced in distinct ways. There are various algorithms are introduced for clustering categorical data these algorithms are available for clustering categorical data but no single algorithm can achieve the best result for all the data sets. Many different clustering algorithms for categorical data are found to solve the problem from a different perspective that is based on the idea of co-occurrences between attributes and pairs defining a cluster and subspace algorithm locates clusters in different subspaces of the data set. It is a difficult task to cluster large amount of data to find a suitable partition in an unsupervised learning. Without any prior knowledge trying to maximize the similarity of objects belonging to the same cluster and minimizing the similarity among objects in different clusters. Each algorithm has its own advantages and disadvantages. These algorithms are executed with the specific data set and with the different or the same algorithm with the distinct parameters obtain diverse results. So it is difficult to decide that which algorithm works well. To avoid such confusion of algorithm that which algorithm would be good for available data set and to overcome the limitations of the algorithm there is a new approach of cluster ensemble which efficient result and it also improves the quality of the result. There is a detailed description of Cluster Ensemble method is given as follows: The clustering ensemble has emerged as a prominent method for improving the accuracy of unsupervised learning. It combines multiple