International Journal of Computer Applications (0975 – 8887) Volume 41– No.7, March 2012 44 Performance Comparison of Hard and Soft Approaches for Document Clustering Vibekananda Dutta Central University of Rajasthan Kishangarh, India Krishna Kumar Sharma Central University of Rajasthan Kishangarh, India Deepti Gahalot Govt.Engineering College Ajmer, India ABSTRACT There is a tremendous spread in the amount of information on the largest shared information source like search engine. Fast and standards quality document clustering algorithms play an important role in helping users effectively towards vertical search engine, World Wide Web, summarizing & organizing information. Recent surveys have shown that partitional clustering algorithms are more suitable for clustering large datasets like World Wide Web. However the K-means algorithm is the most commonly used in partitional clustering algorithm because it can easily be implemented and most efficient interms of execution in time. In this paper we represent a short overview of method for soft approaches of an optimal fuzzy document clustering algorithm as compare to the hard approaches. In the experiment we conducted, we applied the Hard and soft approaches like K-means and Fuzzy c-means on different text document datasets. The number of document in the datasets ranges from 1500 to 2600 and the number of terms ranges from 6000 to over 7500 in both hard and soft approaches. The results illustrate that the soft approaches can generated slightly better result than the hard approaches. General Terms Soft approaches, hard approaches, Partitional clustering, vertical search engine,TF-IDF(Term frequency-inverse document frequency) Keywords Document Clustering, Hard and Soft Approaches, Text Datasets, Cluster Centriod and Vector Space Model. 1. INTRODUCTION Document clustering is a fundamental operation used in unsupervised document organization, automatic topic extraction, and information retrieval. Clustering involves dividing a set of objects into a specified number of clusters [14]. The motivation behind clustering a set of data is to find inherent structure in the data and expose this structure as a set of groups. The data objects within each group should exhibit a large degree of similarity while the similarity among different clusters should be minimized [5, 9, 13]. There are two major clustering techniques: “Partitioning” and “Hierarchical” [9]. Most document clustering algorithms can be classified into these two groups. Hierarchical techniques produce a nested sequence of partition, with a single, all-inclusive cluster at the top and single clusters of individual points at the bottom. The partitioning clustering method seeks to partition a collection of documents into a set of non-overlapping groups, so as to maximize the evaluation value of clustering. Although the hierarchical clustering technique is often portrayed as a better quality clustering approach, this technique does not contain any provision for the reallocation of entities, which may have been poorly classified in the early stages of the text analysis [9]. Moreover, the time complexity of this approach is quadratic [13]. In recent years, it has been recognized that the partitional clustering technique is well suited for clustering a large document dataset due to their relatively low computational requirements [13]. The time complexity of the partitioning technique is almost linear, which makes it widely used. The best-known partitioning clustering algorithm is the K-means algorithm and its variants [10]. This algorithm is simple, straightforward and is based on the firm foundation of analysis of variances. The K-means algorithm clusters a group of data vectors into a predefined number of clusters. It starts with a random initial cluster centers and keeps reassigning the data objects in the dataset to cluster centers based on the similarity between the data object and the cluster centers. The reassignment procedure will not stop until a convergence criterion is met (e.g., the fixed iteration number or the cluster result does not change after a certain number of iterations). The main drawback of the Hard Approaches (K-means) algorithm is that the cluster result is sensitive to the selection of the initial cluster centroids and may converge to the local optima [12]. Therefore, the initial selection of the cluster centroids decides the main processing of K-means and the partition result of the dataset as well. The main processing of K-means is to search the local optimal solution in the vicinity of the initial solution and to refine the partition result. The