Proc. of Int. Conf. on Advances in Computer Engineering 2011
© 2011 ACEEE
DOI: 02.ACE.2011.02.
Short Paper
114
Cross Lingual Information Retrieval Using
Search Engine and Data Mining
Mallamma V Reddy
1
, Dr. M. Hanumanthappa
2
, Manish Kumar
3
1,2
Department of Computer Science and Applications,
Bangalore University, Bangalore, INDIA
1
mallamma_vreddy@yahoo.co.in
2
hanu6572@hotmail.com
3
Department of Master of Computer Applications,
M. S. Ramaiah Institute of Technology, Bangalore-560 054, INDIA
3
manishkumarjsr@yahoo.com
Abstract:-With the explosive growth of international users,
distributed information and the number of linguistic
resources, accessible throughout the World Wide Web,
information retrieval has become crucial for users to find,
retrieve and understand relevant information, in any language
and form. Cross- Language Information Retrieval (CLIR) is a
subfield of Information Retrieval which provides a query in
one language and searches document collections in one or
many languages but it also has a specific meaning of cross-
language information retrieval where a document collection
is multilingual. In the present research, we focus on query
translation, disambiguation of multiple translation candidates
and query expansion with various combinations, in order to
improve the effectiveness of retrieval. Extracting, selecting
and adding terms that emphasize query concepts are performed
using expansion techniques such as, pseudo-relevance
feedback, domain-based feedback and thesaurus-based
expansion. A method for information retrieval for a query
expressed in a native language is presented in this paper. It
uses insights from data mining and intelligent search for
formulating the query and parsing the results.
Keywords: Cross Lingual Information Retrieval, Heuristic
Method, Text Categorization
I. INTRODUCTION
Cross-Language Information Retrieval (CLIR) is where
the user request and the document collection against which
the request is to be matched are in two different human
languages. The aim of CLIR is to match the request against
the collection as if the request had been issued in the
document collection language to begin with. This kind of
system is useful in the situation where a user who can read
several different languages wants to find information in a
collection containing documents in many languages, while
avoiding the work involved in formulating multiple requests.
Cross-language information retrieval [1] enables users to enter
queries in languages they are fluent in, and uses language
translation methods to retrieve documents originally written
in other languages. The scope for Cross-Language
Information Access [2] goes beyond the Cross-Lingual
Information Retrieval (CLIR) paradigm by incorporating query
disambiguation as well as post search processing. The key
emphasis is on the relevance of the results. The cross lingual
information access paradigm may take the form of machine
translation of snippets, summarization and subsequent
translation of summaries and/or information extraction from
the target language. CLIR has three basic approaches [3] : a)
document translation - where the queries are posed to existing
document repositories, b) query translation - where the
queries are translated into the target language and results
displayed, and c) inter lingual translations - where queries
and results are translated. Our approach is a variation of the
third category.
Objectives of our research work are to:-
Develop an approach for cross lingual information retrieval
for queries expressed in the native language.
Use data mining techniques to cluster the results and
retrieve a resultant set closest to the user’s query, and
Present the results in various display methods to the user.
The key aspect of the proposed approach is as follows. It is
composed of two distinct aspects: preprocessing the query
to identify the query’s meaning and post-processing the
results for relevance match. In the preprocessing stage, the
query will be expanded and the expanded query is presented
to the search engine. In the post processing stage, based on
the relevance match of the retrieved content, the resultant
documents will be reordered and presented to the user. The
initial feedback from the users seems to indicate that the
relevance of the retrieved documents is higher than the
conventional approach. However, the time needed to perform
the processing is a significant factor.
Introduction to Information Retrieval
An Information Retrieval System is defined as any system
that matches a user request against a document collection,
returning a list of documents considered relevant to the
request. The user request is an expression of a user
information need. For example, the user might issue the
following request. Have you got any documents pertaining
to the Clinton Lewinsky scandal, particularly regarding his
testimony before Congress?
An automatic IR system then usually carries out some
processing on the user request to derive a form of the re-
quest that it can match directly against the document collec-
tion using some form of matching algorithm. The processed
request, which may take many forms, is known as the query.
. Query formats commonly employed in the IR world include
the natural language query, where the request is not pro-
cessed much at all, and the bag of words format, where
80