Proc. of Int. Conf. on Advances in Computer Engineering 2011 © 2011 ACEEE DOI: 02.ACE.2011.02. Short Paper 114 Cross Lingual Information Retrieval Using Search Engine and Data Mining Mallamma V Reddy 1 , Dr. M. Hanumanthappa 2 , Manish Kumar 3 1,2 Department of Computer Science and Applications, Bangalore University, Bangalore, INDIA 1 mallamma_vreddy@yahoo.co.in 2 hanu6572@hotmail.com 3 Department of Master of Computer Applications, M. S. Ramaiah Institute of Technology, Bangalore-560 054, INDIA 3 manishkumarjsr@yahoo.com Abstract:-With the explosive growth of international users, distributed information and the number of linguistic resources, accessible throughout the World Wide Web, information retrieval has become crucial for users to find, retrieve and understand relevant information, in any language and form. Cross- Language Information Retrieval (CLIR) is a subfield of Information Retrieval which provides a query in one language and searches document collections in one or many languages but it also has a specific meaning of cross- language information retrieval where a document collection is multilingual. In the present research, we focus on query translation, disambiguation of multiple translation candidates and query expansion with various combinations, in order to improve the effectiveness of retrieval. Extracting, selecting and adding terms that emphasize query concepts are performed using expansion techniques such as, pseudo-relevance feedback, domain-based feedback and thesaurus-based expansion. A method for information retrieval for a query expressed in a native language is presented in this paper. It uses insights from data mining and intelligent search for formulating the query and parsing the results. Keywords: Cross Lingual Information Retrieval, Heuristic Method, Text Categorization I. INTRODUCTION Cross-Language Information Retrieval (CLIR) is where the user request and the document collection against which the request is to be matched are in two different human languages. The aim of CLIR is to match the request against the collection as if the request had been issued in the document collection language to begin with. This kind of system is useful in the situation where a user who can read several different languages wants to find information in a collection containing documents in many languages, while avoiding the work involved in formulating multiple requests. Cross-language information retrieval [1] enables users to enter queries in languages they are fluent in, and uses language translation methods to retrieve documents originally written in other languages. The scope for Cross-Language Information Access [2] goes beyond the Cross-Lingual Information Retrieval (CLIR) paradigm by incorporating query disambiguation as well as post search processing. The key emphasis is on the relevance of the results. The cross lingual information access paradigm may take the form of machine translation of snippets, summarization and subsequent translation of summaries and/or information extraction from the target language. CLIR has three basic approaches [3] : a) document translation - where the queries are posed to existing document repositories, b) query translation - where the queries are translated into the target language and results displayed, and c) inter lingual translations - where queries and results are translated. Our approach is a variation of the third category. Objectives of our research work are to:- Develop an approach for cross lingual information retrieval for queries expressed in the native language. Use data mining techniques to cluster the results and retrieve a resultant set closest to the user’s query, and Present the results in various display methods to the user. The key aspect of the proposed approach is as follows. It is composed of two distinct aspects: preprocessing the query to identify the query’s meaning and post-processing the results for relevance match. In the preprocessing stage, the query will be expanded and the expanded query is presented to the search engine. In the post processing stage, based on the relevance match of the retrieved content, the resultant documents will be reordered and presented to the user. The initial feedback from the users seems to indicate that the relevance of the retrieved documents is higher than the conventional approach. However, the time needed to perform the processing is a significant factor. Introduction to Information Retrieval An Information Retrieval System is defined as any system that matches a user request against a document collection, returning a list of documents considered relevant to the request. The user request is an expression of a user information need. For example, the user might issue the following request. Have you got any documents pertaining to the Clinton Lewinsky scandal, particularly regarding his testimony before Congress? An automatic IR system then usually carries out some processing on the user request to derive a form of the re- quest that it can match directly against the document collec- tion using some form of matching algorithm. The processed request, which may take many forms, is known as the query. . Query formats commonly employed in the IR world include the natural language query, where the request is not pro- cessed much at all, and the bag of words format, where 80