1 Clustering and Classification of Text Documents Using Improved Similarity Measure G.SureshReddy 1 , T.V.Rajinikanth 2 , A.AnandaRao 3 1 Department of Information Technology, VNR VJIET, Hyderabad, India 3 Department of Computer Science and Engineering, JNTU University, Anantapur, India 2 Department of Computer Science and Engineering, SNIST, Hyderabad, India Abstract: Dimensionality reduction is very challenging and important in text mining. We need to know which features be retained what to be and It helps in reducing the processing overhead when performing text classification and text clustering. Another concern in text clustering and text classification is the similarity measure which we choose to find the similarity degree between any two text documents. In this paper, we work towards text clustering and text classification by addressing dimensionality reduction using SVD followed by the use of the proposed similarity measure which is an improved version of our previous measure [25, 31]. This proposed measure is used for supervised and un-supervised learning. The proposed distance measure overcomes the disadvantages of the existing measures [10]. Keywords: Feature Vector, Similarity, Feature Set, Commonality 1. Introduction Text mining may be defined as the field of research which aims at discovering; retrieving the hidden and useful knowledge by carrying out automated analysis of freely available text information and is one of the research fields evolving rapidly from its parent research field information retrieval [1]. Text mining involves various approaches such as extracting text information, identifying and summarizing text, text categorization and clustering. Text Information may be available either in structured form or unstructured form. One of the widely studied data mining algorithms in the text domain is the text clustering. Text clustering may be viewed as an unsupervised learning approach which essentially aims at grouping all the text files which are of similar nature into one category thus separating dissimilar content in to the other groups. In contrast to the text clustering approach, the process of text classification is a supervised learning technique with the class labels known well before. In this paper, we limit our work to text clustering and classification. Clustering is a NP-hard problem. One common challenge for clustering is the curse of dimensionality which makes clustering a complex task. The second challenge for text clustering and classification approaches is the sparseness of word distribution. The sparseness of features makes the classification or clustering processes in accurate, in efficient and thus becoming complex to judge the result. The third challenge is deciding the feature size of the dataset. This is because the features which are relevant may be eliminated in the process of noise elimination. Also deciding on the number of clusters possible is also a complex and debatable. In this paper, we carry out the dimensionality reduction at two stages. The first stage of dimensionality reduction takes in to the consideration elimination of stop words, stemmed words, followed by computation of tf-idf. The second stage of dimensionality reduction is by the use of singular valued decomposition approach. This is followed by the use of proposed improved similarity measure w.r.t similarity measure [25]. The proposed measure is applied to supervised learning process and also for Special issue on “Computing Applications and Data Mining” International Journal of Computer Science and Information Security (IJCSIS), Vol. 14 S1, February 2016 39 https://sites.google.com/site/ijcsis/ ISSN 1947-5500