Journal of Theoretical and Applied Information Technology 10 th February 2014. Vol. 60 No.1 © 2005 - 2014 JATIT & LLS. All rights reserved . ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 73 A NOVEL APPROACH FOR TEXT CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 1 J.DAFNI ROSE, 2 DIVYA D. DEV, 3 C.R.RENE ROBIN 1 St.Joseph’s Institute Of Technology, Department Of Computer Science And Engineering, Chennai-119 2 St.Joseph’s College Of Engineering, Department Of Computer Science And Engineering, Chennai-119 3 Jerusalem College Of Engineering, Department Of Computer Science And Engineering, Chennai-100 E-mail: 1 jdafnirose@yahoo.co.in , 2 divyaddev@gmail.com , 3 crrenerobin@gmail.com ABSTRACT Text clustering is used to group documents with high levels of similarity. It has found applications in different areas of text mining and information retrieval. The digital data available nowadays has grown in huge volume and retrieving useful information from that is a big challenge. Text clustering has found an important application to organize the data and to extract useful information from the available corpus. In this paper, we have proposed a novel method for clustering the text documents. In the first phase features are selected using a genetic based method. In the next phase the extracted keywords are clustered using a hybrid algorithm. The clusters are classed under meaningful topics. The MLCL algorithm works in three phases. Firstly, the linked keywords of the genetic based extraction method are identified with a Must Link and Cannot Link algorithm (MLCL). Secondly, the MLCL algorithm forms the initial clusters. Finally, the clusters are optimized using Gaussian parameters. The proposed method is tested with datasets like Reuters-21578 and Brown Corpus. The experimental results prove that our proposed method has an improved performance than the fuzzy self-constructing feature clustering algorithm. Keywords: Genetic Algorithm, Keyword Extraction, Text Clustering, MLCL Algorithm. 1. INTRODUCTION Text mining is an important process in the field of information retrieval [1]. Text mining comprises of a wide range of processes like text clustering, classification, text summarization and automatic organization of text documents. Documents that are available on the internet are increasing day by day and most of them are loosely structured. Clustering has become a significant and widely used text mining tool to structure these documents so that similar documents are clustered into the same group and dissimilar documents are separated into different groups. [17] Text clustering is an unsupervised learning method where similar documents are grouped into clusters. It is defined as method of finding groups of similar objects in the data. The similarity between objects is calculated using various similarity functions. Clustering can be very useful in various text domains, where the objects to be clustered are of various types such as paragraphs, sentences, documents or terms. Clustering helps to organize the documents which will further help to improve information retrieval and support browsing [4]. The quality of any text mining methods such as classification and clustering is highly dependent on the noisiness of the features that are used for the process. Therefore, the features should be selected effectively to improve the clustering quality. Some of the commonly used feature selection methods are document frequency based selection method, term strength and entropy based ranking. After the features have been selected, any text mining tasks such as classification, clustering, summarization can be applied [4]. The basic characteristics of text document include high dimensionality, sparsity and noisy features. The performance of the clustering algorithms is influenced by these properties. [17] Text clustering finds numerous applications in customer segmentation, classification, visualization, document organization and indexing. The two main classifications of clustering algorithms are hierarchical based and K-means