International Journal of Computer Applications (0975 – 8887) Volume 4 – No.5, July 2010 6 A Frequent Concepts Based Document Clustering Algorithm Rekha Baghel Department of Computer Science & Engineering Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, 144011, India. Dr. Renu Dhir Department of Computer Science & Engineering Dr. B. R. Ambedkar National Institute of Technology, Jalandhar, Punjab, 144011, India. ABSTRACT This paper presents a novel technique of document clustering based on frequent concepts. The proposed technique, FCDC (Frequent Concepts based document clustering), a clustering algorithm works with frequent concepts rather than frequent items used in traditional text mining techniques. Many well known clustering algorithms deal with documents as bag of words and ignore the important relationships between words like synonyms. the proposed FCDC algorithm utilizes the semantic relationship between words to create concepts. It exploits the WordNet ontology in turn to create low dimensional feature vector which allows us to develop a efficient clustering algorithm. It uses a hierarchical approach to cluster text documents having common concepts. FCDC found more accurate, scalable and effective when compared with existing clustering algorithms like Bisecting K-means , UPGMA and FIHC. Keywords Document clustering, Clustering algorithm, Frequent Concepts based Clustering, WordNet. 1. INTRODUCTION The steady and amazing progress of computer hardware technology in the last few years has led to large supplies of powerful and affordable computers, data collection equipments, and storage media. This technology provides a great boost to the database and information industry and makes a huge number of databases and information repositories available for transaction management, information retrieval, and data analysis. So we can say that this technology provides a tremendous growth in the volume of the text documents available on the internet, digital libraries, news sources and company-wide intranets. With the increase in the number of electronic documents, it is hard to manually organize, analyze and present these documents efficiently. Data mining is the process of extracting the implicit, previously unknown and potentially useful information from data. Document clustering is one of the important techniques of data mining which of unsupervised classification of documents into different groups (clusters), so those documents in each cluster share some common properties according to some defined similarity measure. So Documents in same cluster have high similarity but they are dissimilar to documents in other cluster [1]. Let‟s observe closely the special requirements for good clust ering algorithm: 1. The document model should better preserve the relationship between words like synonyms in the documents since there are different words of same meaning. 2. Associating a meaningful label to each final cluster is essential. 3. The high dimensionality of text documents should be reduced. The goal of this paper is to present a proposed document clustering algorithm, named FCDC (Frequent Concepts based clustering), is designed to meet the above requirements for good text clustering algorithm. The special feature of proposed FCDC algorithm is: it treats the documents as set of related words instead of bag of words. Different words shares the same meanings are known as synonyms. Set of these different words that have same meaning is known as concept. So whether document share the same frequent concept or not is used as the measurement of their closeness. So our proposed algorithm is able to group documents in the same cluster even if they do not contain common words. In FCDC, we construct the feature vector based on concepts and apply an Apriori paradigm [2] for discovering frequent concepts then frequent concepts are used to create clusters. We found our FCDC algorithm is more efficient and accurate than other clustering algorithms. The rest of the paper is organized as follows: Section 2 describes the literature review of this work. Section 3 describes our algorithm in more detail. Section 4 discussed some experimental results. We conclude the paper in section 5. 2. RELATED WORK Many clustering techniques have been proposed in the literature. Clustering algorithms are mainly categorized into hierarchical and partitioning methods [2, 3, 4 5]. A hierarchical clustering method works by grouping data objects into a tree of clusters [6]. These methods can further be classified into agglomerative and divisive hierarchical clustering depending on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion. K- means and its variants [7, 8, 9] are the most well-known partitioning methods [10]. Lexical chains have been proposed in [11] that are constructed from the occurrence of terms in a document. Problem to improve the clustering quality is addressed in [10] where the cluster size varies by a large scale. They have stated that variation of cluster size reduces the clustering accuracy for