International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 4, August 2014 DOI:10.5121/ijcsit.2014.6413 195 LANGUAGE INDEPENDENT DOCUMENT RETRIEVAL USING UNICODE STANDARD Vidhya M 1 and Aji S 2 Department of Computer Science, University of Kerala Thiruvananthapuram, Kerala, India ABSTRACT In this paper, we presented a method to retrieve documents with unstructured text data written in different languages. Apart from the ordinary document retrieval systems, the proposed system can also process queries with terms in more than one language. Unicode, the universally accepted encoding standard is used to present the data in a common platform while converting the text data into Vector Space Model. We got notable F measure values in the experiments irrespective of languages used in documents and queries. KEYWORDS Language independent searching, Information Retrieval, Multilingual searching, Unicode, QR Factorization, Vector Space Model. 1. INTRODUCTION The digital world is becoming a universal storehouse of knowledge and culture, which has allowed an unswerving sharing of ideas and information in an unpredictable rate. The efficiency of a depository is always measured in terms of the accessibility of information, which depends on the searching process that essentially acts as filters for the richness of information available in a data container. The source of information or data in the depository can be any type of digital data that have been produced or transformed into the digital format such as electronic books, articles, music, movies, images, etc. The language is also not becoming a barrier for expressing the ideas and thoughts in the digital form; as a result, the diversity of data is also increasing along with the volume. Research and findings in Information Retrieval methods that support Multilanguage has a vital role in the upcoming years of information era. Even though there are lots of findings in the information retrieval mechanisms [1, 11], there is not that much works in the language independent information retrieval methods. The output of the CLEF (Cross-Language Evaluation Forum) workshop [18] also points the need for increasing research in multi-lingual processing and retrieval. Text is the natural way recording the thoughts and feelings of human being and as a result; the text data became the major component in the entire digital data. In traditional Information Retrieval (IR) methods [1], the text is tokenized into words and a corresponding Vector Space Model (VSM) [11] is created in the initial phase. The IR algorithms such as LSI [19], QR Factorization [10], etc. are applied to retrieve the documents which are most relevant to the query given by the user. Since the steps stemming and stop word elimination [20] are purely language dependent, the VSM formed in most of the works related with IR are language dependent, Naturally this type of VSM cannot accommodate the documents which are written in a couple of