J. Dafni Rose, International Journal of Advanced Engineering Technology E-ISSN 0976-3945 Int J Adv Engg Tech/Vol. VII/Issue I/Jan.-March,2016/751-753 Research Paper AN EFFICIENT ASSOCIATION RULE BASED HIERARCHICAL ALGORITHM FOR TEXT CLUSTERING J. Dafni Rose Address for Correspondence Associate Professor & HOD, Department of Computer Science and Engineering, St. Joseph’s Institute of Technology, Chennai, India ABSTRACT In this modern era, the amount of information available has become too large. But are we getting useful information still remain a question. Text clustering is one of the techniques that helps organize information and hence obtain information in a more efficient manner. This paper presents a new technique for clustering text documents based on association rule based systems. In this approach, the text documents are preprocessed and the association between the text files are found using Apriori algorithm. The associated text files are clustered using hierarchical clustering algorithm. The text files are also clustered using hierarchical algorithm. The results of both the methods are evaluated. The algorithms are tested on benchmark data set Reuters-21578. The experimental results prove that the Association Rule Based Hierarchical clustering method (ARBHC) produce better results and also improved cluster quality over hierarchical method. KEYWORDS- Text clustering, Association rule, Hierarchical algorithm, Apriori. I. INTRODUCTION The increase in the amount of documents in digital libraries, blogs, mails have led to the development of effective and efficient organization of text documents. Text Clustering is a technique that is used to classify texts or passages in natural categories that arise from statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of traditional manual indexing systems. In the context of text mining, it is the derivation of the categories which is of interest, since this is a form of theme finding. Text Clustering algorithms are generally divided into hierarchical methods such as agglomerative, divisive and partition based algorithms such as K-means. Agglomerative algorithms, such as UPGMA [1], single link [2] and Chameleon [3], find the clusters by considering each data as one cluster and then repeatedly merging pairs of clusters until a termination criterion is met, while partitional algorithms, such as k-means [4], bisection- k-means [5] and graph-based [6], find the clusters by partitioning the dataset into a number of small clusters. Partitional algorithms are often sensitive to the initial cluster centroids. Efficiency of partition based methods also depends on the number of clusters specified. Moreover, they fail to produce satisfactory clustering results due to the sparsity and high-dimensionality of document datasets. Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering algorithm such as k-means. Hierarchical clustering does not require the user to pre specify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic. Hierarchical algorithm produces hierarchical solution even though it is inefficient for high dimensional databases. In general, flat clustering such as k-means is selected when efficiency is important and hierarchical clustering is selected when one of the potential problems of flat clustering such as not enough structure, predetermined number of clusters, non- determinism is a concern [7]. Hierarchical algorithm is usually applied on large number of documents. But most of the documents are not really related with each other. So when the hierarchical algorithm is applied on the whole document the performance of the algorithm is reduced. So, in this paper, we propose an improved hierarchical algorithm. In this method the related text documents are found using Apriori algorithm. The remaining documents are removed from the database. This reduces the size of the database to a large extent and the documents that are in the database will be similar. The associated documents alone are then clustered using hierarchical algorithm. This improves the efficiency of the algorithm. The clustered documents are evaluated using cophenetic correlation matrix. Experimental results show that using association rule for clustering outperforms the traditional agglomerative hierarchical algorithm. The rest of this paper is organized as follows. The next section reviews some related work on document clustering. In section 3, a detailed description of the proposed approach is presented. In section 4, experimental results that evaluate the proposed approach are presented. Finally, the paper is concluded with conclusion. II. RELATED WORK In [8], Yehang Zhu proposes a novel hierarchical clustering method which is a hybrid version of both partitioning and agglomerative clustering approaches. This method combines the merits of agglomerative and partition clustering methods. Partitioning clustering is first applied to determine the initial clusters and then hierarchical clustering algorithm is applied to build a hierarchical output. S. S. Bedi [9] presents two new clustering algorithms that cluster documents effectively in high dimensional space. In this paper, the set of items that occur frequently together in transactions are found using association rule discovery methods. The frequent items are then grouped into hyper graph edges and the clusters are found using hyper graph partitioning algorithm. Alisa Kongthon [10], presents a new algorithm called “concept grouping”, that adapts an association rule mining technique to construct term thesaurus for data preprocessing purpose. In this paper, similar terms, but written differently, are grouped together into the same concept based on their associations before they are used for subsequent analysis. This technique is used for data preprocessing process. This new Concept grouping algorithm is based on “tree structured networks” to construct thesaurus from related terms. In [11], an improved association rule algorithm is proposed for intelligent QA system. In this work an improved text cluster algorithm, along with the improved association rules algorithm is