An Enhanced Approach of Preprocessing the Document using WordNet in Text Clustering Tulika Narang University of Allahabad Allahabad, India n.tulika@gmail.com Shashi Prakash Tripathi University of Allahabad Allahabad, India shashi.123.prakash@gmail.com Abstract — This paper presents improvement on the existing clustering approaches by performing the preprocessing of the document corpus before clustering them in to different clusters. With the help of natural language toolkit the definitions of the words are replaced by the word in corpus so as to reduce the ambiguity caused when the word is considered out of it context. That reduces the feature space as well as the dimension of the corpus. We use K-means and Hierarchical Clustering techniques to cluster the document corpus and visualization of results. Keywords—Clustering, WSD, K-means, Hierarchical, R language I. INTRODUCTION Text clustering helps in organizing large corpus in to smaller attainable and relevant groups which is important for information retrieval, comprehension and browsing of corpus. The approaches that were earlier used for clustering basically depend on the BOW (Bag of words) so it was not able to address the semantic relation and also the sense of corpus was not accurately represented. As the whirlwind growth of textual data, there has been a massive growth in the vocabulary, dimension and semantic information of the data. Accordingly we need an algorithm that can represent the sense of document, enhance the clustering performance and also capable of processing small size of sample data. Many people are currently working on semantic-based approaches. WordNet [1] , the combination of dictionary and thesaurus is most widely used for improving the quality of text clustering with help of semantic parameters. However, we still have some challenges: Synonym and polysemy problems High-dimensional term features Exact core Semantic From text Assign distinguished and meaningful description for the generated clusters People used ontology in place of original words to solve the problem of synonym and polysemy this approach of problem solving is knows as word sense dis-ambiguation (WSD) [2] . This approach does not increase the clustering performance but in turn increase the feature space. The dimension of clustering performance too can be reduced. So we have to focus on the dimension of document with higher priority. The available techniques that are used to resolve this problem are based on matrix operations such as ICA, LDA and LSI. These model requires higher computation which is also a drawback. However there are many other models that consider the semantic relation but they too have their own drawbacks like they do not consider the sense of corpus. Sometimes when the dimension of the document is reduced the feature space of the document also gets reduced in turn reducing the semantic content of the data. It is desired to extract the dis-ambiguous words with their core semantic feature that are “cluster-aware” which steers to improve the optimality with lesser number of terms. This paper consist the comparison of text clustering results when clustering is performed on the preprocessed document and when it is directly applied to the document. The different phases of clustering can be explained using the figure given below.