International Journal of Computer Applications (0975 – 8887) Volume 120 – No.9, June 2015 29 A Review on Text Similarity Technique used in IR and its Application Nitesh Pradhan M.Tech Scholar Computer Science Department Maulana Azad National Institute of Technology BHOPAL (M.P) Manasi Gyanchandani, PhD Asst. Professor Computer Science Department Maulana Azad National Institute of Technology BHOPAL (M.P) Rajesh Wadhvani, PhD Asst. Professor Computer Science Department Maulana Azad National Institute of Technology BHOPAL (M.P) ABSTRACT With large number of documents on the web, there is a increasing need to be able to retrieve the best relevant document. There are different techniques through which we can retrieve most relevant document from the large corpus. Similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. Text similarity means user’s query text is matched with the document text and on the basis on this matching user retrieves the most relevant documents. Text similarity also plays an important role in the categorization of text as well as document. We can measure the similarity between sentences, words, paragraphs and documents to categorize them in an efficient way. On the basis of this categorization, we can retrieve the best relevant document corresponding to user’s query. This paper describes different types of similarity like lexical similarity, semantic similarity etc. General Term Text Similarity, Text Mining, Text Summarization Keyword Text similarity, Lexical similarity, semantic similarity, Corpus based similarity and Knowledge based similarity. 1. INTRODUCTION Information Retrieval is an activity of obtaining information resources relevant to an information need from a collection of information resources. Information Retrieval has different type of applications out of these applications, Blog search is one of the important application. The searching can be done on the basis of similarity. Similarity is a process through which we determine the relationship between text snippets. Text similarity is defined in two ways; these are lexical similarity and semantic similarity. Lexical similarity provide the similarity on the basis of character or statement matching, for e.g. “Put” and “Cut” are lexical similar to each other. Whereas Semantic similarity provide the similarity on the basis of meaning, for e.g. “Support Vector Machine” and “SVM” both are semantic similar to each other. There are several applications or areas where we use the text similarity; these areas are Information retrieval, clustering, text categorization, topic detection, question answer session, machine translation, text summarization etc. The further sections are as follows: Second section describes all the Lexical similarity measure technique. All the Semantic similarity technique is describe in third section. Fusion similarity and conclusion is described in fourth and fifth section respectively. 2. LEXICAL SIMILARITY In Lexical similarity [19] provides the similarity on the basis of character and statement matching. Lexical similarity is a measure of the degree to which word set of two given string are similar. A Lexical similarity of 1(means 100%) would mean a total overlap between words, Whereas Lexical similarity of 0 means there are no common word in given string. This survey represents the most popular lexical similarity measure which was implemented in SimMetrics [1] package. Lexical similarity is categorized in Character based similarity and statement based similarity. In character based similarity four different algorithms are described and in statement based similarity five different algorithms are described as shown in Figure 1. 2.1 Character based similarity 2.2 .1 Longest common subsequence (LCS) Similarity LCS [2] matching is a commonly used technique to measure the similarity between two string (i, j). LCS measure the longest total length of all the matched substring between two string where these sub-string appear in the same order as they appear in the other string. LCS similarity of Given Two string (i, j) will be                                         Longest common subsequence is based on dynamic programming approach which takes O(n). LCS represents a distance matrix and can be used for indexing in database but the problem with LCS is space complexity. LCS uses recursion approach which uses stack that takes lots of space