International Journal of Engineering & Technology, Management and Applied Sciences 1 Ramakrishna Kolikipogu www.ijetmas.com Vol 1, Issue 1 June 2014 Vector Space Model for Telugu Information Retrieval Ramakrishna Kolikipogu Department of IT SWEC (JNTUH Hyderabad) krkrishna.csit@gmail.com ABSTRACT The explosion of web usage and development of the internet in Asia over the last few decades facilitate the most of the Indian languages to grow on the web and the availability of such digital text documents has been considerably increased. As such languages are being increasing, efficient methods are required to access the information. Information Retrieval in Indian Languages (IRIL) is growing tremendously in a retrieval era, and the efforts of researchers are considerably notable though the IR in Indian Languages is in a nascent stage. Indian Languages are reported to be rich in morphology. Telugu is one of the south Indian languages in which word conflation is very high. A normalization procedure is always demanded for query translation where a word is transformed into its bare stem, but the inflective form of the same stem in many ways is likely to be appeared in the document and it is a tedious task to find all the matches and to convert them as like query, So that the search would give effective results. This paper will give the step by step understanding of how an information retrieval system in Telugu works for the user query. The experiments have been conducted on a standard Telugu News Corpus consists of 1528 items with 50 queries. The system performance has been measured by calculating precision at various levels of top ranked documents. It is observed that around 50% of output documents to the user queries are relevant when the total relevant documents in the collection are unknown. But with the known list of relevant documents in the corpus, it is observed that an average accuracy of Telugu Information Retrieval system is found to be 61% in terms of Interpolated precision. Keywords Information Retrieval, Telugu, Encoding Standards, Romanization, Tokenization, Indian Languages. INTRODUCTION Information retrieval system (IRS) is aimed to retrieve the relevant and interested items for the given user query. The traditional IR approach can be considered to be similar to what a web search engine, such as Google, does, although the original IR engines predate the existence of the web, searching locally-stored collections of documents instead. Item retrieval is about matching of some user supplied topic or query against useful parts of text records and presents the retrieved results to the end user. Such records can be any type, mainly in this thesis documents consists of unstructured text, such as Newspaper articles, Literature pages, Wikipedia Review articles data. User queries could range from single word to multi-sentence full descriptions of an information need. In general, a user query consists of „Bag of Words‟ to represents the search interest. The bag of words needs to be indexed before matching the indexed documents. The basic retrieval process using TF-IDF based similarity has been adopted to test the performance of Telugu IR system. The Figure 1 shows abstract view of Generic Information Retrieval system.