International Journal of Computer Applications (0975 8887) Volume 94 No.14, May 2014 37 Framework for Document Retrieval using Latent Semantic Indexing Neelam Phadnis Computer Engineering (M.E) Thadomal Shahani Engg.College Mumbai, India Jayant Gadge Computer Engineering (M.E) Thadomal Shahani Engg. College Mumbai, India ABSTRACT Today, with the rapid development of the Internet, textual information is growing rapidly. So document retrieval which aims to find and organize relevant information in text collections is needed. With the availability of large scale inexpensive storage the amount of information stored by organizations will increase. Searching for information and deriving useful facts will become more cumbersome. How to extract a lot of information quickly and effectively has become the focus of current research and hot topics. The state of the art for traditional IR techniques is to find relevant documents depending on matching words in users’ query with individual words in text collections. The problem with Content-based retrieval systems is that documents relevant to a users’ query are not retrieved, and many unrelated or irrelevant materials are retrieved. In this paper information retrieval method is proposed based on LSI approach. Latent Semantic Indexing (LSI) model is a concept based retrieval method that exploits the idea of vector space model and singular value decomposition. The goal of this research is to evaluate the applicability of LSI technique for textual document search and retrieval. General Terms Information Retrieval Keywords Document Retrieval, Latent Semantic Indexing, Singular value decomposition 1. INTRODUCTION In today’s world, with the advent of computers the amount of information stored is growing phenomenally in quantity and variety. This information explosion has resulted in a great demand for efficient and effective means for organizing and indexing data so that useful information can be retrieved whenever required. Thus in order to provide users with easy access to the information in which the user is interested some mechanism is needed. This mechanism should be able to retrieve the relevant text timely and accurately. Data retrieval, in the context of an information retrieval system, consists mainly of determining which documents of a collection contain the keywords in the user query. In fact the user of an IR system is concerned more with retrieving information about a subject than with retrieving data which satisfies a given query. Users want the retrieval on the basis of conceptual context. A given concept can be exhibited in number of ways (polysemy). So the literal terms in a users query may not match those of relevant documents. Also most words have multiple meanings (synonymy) so terms in a user’s query match words in documents that are of no use to the user. A new approach to document retrieval which is designed to overcome the fundamental problem of existing retrieval techniques is presented here. In this paper, the proposed approach tries to overcome the problems with term matching retrieval. Statistical techniques are used to estimate the hidden latent semantic structure. Latent Semantic Indexing is one such statistical information retrieval technique. It is based on an algebraic model of document retrieval and uses a dimension reduction technique known as Singular Value Decomposition. In these techniques documents are converted into a collection of weighted terms and the goal is to place documents on the same topic close together and dissimilar documents sufficiently apart [1]. Since the search is based on the concepts contained in the documents rather than the documents constituent terms, LSI can retrieve documents related to a users query even when the query and documents do not share any common terms. 2. RELATED WORK From thousands of years people have practiced the art of archiving and then finding information from this data. The practice of archiving can be traced back to 3000 BC. Even then Sumerians realized the importance of proper organization and access was needed for efficient use of data. The need to store and retrieve information became more important with inventions like paper and printing. With the advent of computers the amount of data being stored increased dramatically as retrieving information from them could be done mechanically. Vannevar Bush published an article in 1945 that gave birth to the concept of automatic access to large amounts of stored information. Several ideas emerged in the mid 1950’s based on searching for text with the help of a computer. Most notable was the development of SMART system by Gerard Salton at Harvard University [2]. The simplest form of document retrieval is linear scan through documents. But it is not efficient when we need to search large document collections quickly. One common problem with information retrieval systems is the issue of predicting which documents are relevant and which are not. Such a decision is usually dependent on a ranking algorithm which attempts to order the documents. Documents at the top of the ranked list are likely to be more relevant. A ranking algorithm operates according to basic premises regarding the notion of document relevance. Distinct set of premises yield distinct information retrieval models. The three classic models in information retrieval are the Boolean, Probability and Vector space models. An information retrieval model consists of a set of representations