Unsupervised system for Lexical Disambiguation of Arabic Language using a vote procedure Abstract— In this paper we propose an unsupervised method for Arabic word sense disambiguation. Using the corpus and the glosses of the ambiguous word, we define a method to generate automatically the context of use for each sense. Since that, we define a similarity measure based on collocation measures to find the most nearest context of use to the sentence containing the ambiguous word. The similarity measure may give more than one sense, for that we define a novel supervised approach called vote procedure. Our work was compared with other related works. We obtained a better rate of disambiguation in the average of 79%. Keywords—Glosses, Stemming, string-matching, Contexts of use, similarity measure, vote procedure. I. INTRODUCTION The task of identifying the meaning of a word is called word sense disambiguation (WSD), which is one of the oldest problems in natural language processing (NLP) [1]. The benefits of WSD were exploited by many NLP applications such as machine translation, information retrieval, grammatical analysis, speech processing as well as text processing [1]. In this work, we are interested in determining the meaning of Arabic ambiguous words. First, we apply a weighting method, in order to find the stop words in the sentence containing the ambiguous word. After several tests during the experimental study, we established a list of stop words which will be eliminated from the original sentence. Then, we use the dictionary and a corpus as a resource for the automatic extraction of contexts of use generated for each sense of the ambiguous word. We apply the stemming [2] for the words contained in the glosses of the ambiguous word, then we use the approximate string-matching algorithm [3] to be able to extract the contexts of uses from the corpus used. Finally, we define a new similarity measure to find the nearest context of use to the original sentence. Some predefined collocation measures were used by the defined similarity measure. In the case where the similarity measures give more than one sense, we define a vote procedure that chooses the correct sense between those proposed. This paper is structured as follows. We describe in section two some related works. After that, in section three, we give a detailed account of the proposed method for disambiguation of Arabic words. Since that in section four, we present the results given by our work and the compared works. II. RELATED WORKS Most of the works related to the WSD were applied to the English; they are classified using the source of knowledge. In fact there are some works applied to Arabic that are described and detailed in what follows. A. Knowledge Based Methods They were introduced in 1970, based on the dictionary, thesaurus and lexicon. Some of them like the Lesk algorithm tested the adequate definitions given by the electronic dictionary. We can cite the work of Guiassa [4] that is based on a dictionary of use. The senses of words are represented as a conceptual vector, since that the Lesk algorithm is used to count the intersections between the sense candidate D (Sj) and the words contained in the same context of the ambiguous word. Also the variants of the Lesk algorithm were evaluated to disambiguate Arabic words [5]. In the first experiment, the original Lesk algorithm was applied using the dictionary as a resource. As a second experience, the Lesk algorithm was modified using Arabic Wordnet and some similarity measures to determine the relatedness between two concepts in Arabic Wordnet. These variants affiliate a score for the most relevant sense of the ambiguous word using two different resources. B. Corpus Based Methods Since the evolution of the statistic methods based on the large text corpus, two principal methods appear. 1) Unsupervised methods. These methods use a non- annotated corpus. The contexts are represented by high- dimensional spaces defined by word co-occurrences. We can cite the Combination of information retrieval methods with Lesk algorithm for Arabic WSD [6], where three information retrieval measures and the latent semantic analysis were applied, to measure the similarity between the context of use for each sense of the ambiguous word and the original sentence. These measures were combined with the Lesk algorithm to achieve an accuracy rate of 73%. Merhbene Laroussi Faculty of Sciences of Monastir LATICE Laboratory University of Monastir, Tunisia Aroussi_merhben@hotmail.com Anis Zouaghi ISSAT Sousse LATICE Laboratory University of Sousse, Tunisia Anis.zouaghi@gmail.com Mounir Zrigui Faculty of Sciences of Monastir LATICE Laboratory University of Monastir, Tunisia Mounir.Zrigui@fsm.rnu.tn U.S. Government work not protected by U.S. copyright