ALTIC -2011, Alexandria, Egypt USING WIKIPEDIA FOR RETRIEVING ARABIC DOCUMENTS Mohamed I. Eldesouki*, Waleed Arafa*, Kareem Darwish**, Mervat Gheith* *Institute of Statistical Studies and Research, Computer Science Department, Cairo University 5 Dr. Ahmed Zwel Street, Orman, Giza, Egypt **Microsoft, Microsoft Innovation Center, Smart Village –Building B115, Kilo 28, Cairo/Alex. Desert Road Abou Rawash, Egypt disooqi@ieee.org , waleed_arafa@hotmail.com , kareem@darwish.org , mervat_gheith@yahoo.com Keywords: Arabic Information Retrieval, Text Processing, Wikipedia, Word Sense Disambiguation Abstract: Although stemming techniques outperform other techniques of text processing, they miss many cases that needs to be conflated into one class. For instance, synonyms that belong to different roots can‟t be conflated to the same class using stemming techniques. In this work, we investigate a new technique for information retrieval for Arabic documents based on concepts to overcome the above problems using the Arabic Wikipedia project. Word sense disambiguation is used for terms that have multiple senses. The new technique has been evaluated with different word sense disambiguation techniques. It also has been examined with different version of Arabic Wikipedia dumps to show that the performance increases evolutionary as Wikipedia develop. After comparing with the results of experiments that use stemming techniques in (Disooqi and Arafa, 2009), although the stemming technique is still better, the continuous growth of Wikipedia improves the performance. Results show that the information retrieval performance is improving as Wikipedia develops and grows. 1. INTRODUCTION There are many cases when two words are not quite the same but you would like a match to occur. Conversely, there could be two words that are identical but you wouldn‟t like match to occur. There are many reasons for such problems; some of these are related to the characteristic of the language itself and other depend on the understanding of the query and documents. In Arabic language, one reason of the first problem is the morphology system that is used to form the various forms of words. Although Arabic morphology system could produce different meaning for different morphological form, sometime you would like matches to occur between these different forms. For example, sometimes you would like a match to occur between a word and its plural form. Another reason is the affixes system of the language, for example, articles in Arabic language concatenate at the beginning of nouns which prevent from matching to nouns without articles and some conjunctions, prepositions and pronouns exist as a prefix for the words. A third reason may arise from habits of writing; some people neglect the writing of HAMZA for the ALEF letter other use diacritics, etc. Forth reason is the existence of multiple synonyms for a word. Fifth, the match could be between word against phrase or phrase against a phrase which, in case of bag of words representation, is not going to match. The second problem, which was the unwillingness matching of two words have the same spelling, happens because the two words have different meaning; the phenomena called “polysemy”. Different techniques have been developed to overcome the difficulties for matching process including normalization process, stemming process, morphological analysis process, n-gram for words, using ontologies, etc. The following section, previous work, discusses some of them. Normalization process is used to address the problem of habits of writing. Normalization removes the diacritics so that words without diacritics match with words that have diacritics and normalize the use of HAMZA and TAA MARBOUTA in words, it also could remove Kashida. Usually normalization is used in conjunction with the aforementioned techniques. It is performed at the beginning of the information retrieval process after