International Journal on Information and Communication Technologies, Vol. 1, No. 1, June 2010 Abstract—This paper describes an approach for improving the re- ranking of passages at the Passage Retrieval (PR) module in the context of the Arabic Question/Answering (Q/A) systems. This approach implements a process performing a semantic QE based on the Arabic WordNet (AWN) ontology with a structure-based PR based on the Distance Density n-gram model. Experiments with a set of translated CLEF and TREC questions have shown that the accuracy, the Mean Reciprocal Rank and the number of answered questions have been significantly improved using our approach. An analysis of the reached performances is discussed in this paper. KeywordsQuestion/Answering; Semantic Query Expansion; Arabic WordNet; Distance Density n-gram model; JIRS I. INTRODUCTION UESTION/ANSWERING (Q/A) systems belong to the category of advanced Information Retrieval (IR) tools. They differ from the widely used Search Engines (SE) since a precise answer is returned to the user rather than a list of links. Indeed, the use of SE presents a constraint for users as they have to manually filter a long set of returned documents. Researches in the field of Q/A have known significant progression for languages such as English, Spanish, French or Italian [25]. In the context of the Arabic language there are few attempts for building Q/A systems. This may be due to the particularities of the language (short vowels, absence of capital letters, complex morphology, etc.). The most well-known Arabic Q/A systems are: QARAB [7] is a system that takes natural language questions expressed in the Arabic language and attempts to provide short answers. The system’s primary source of knowledge is a collection of Arabic newspaper text extracted from Al-Raya 1 , a newspaper published in Qatar. QARAB uses shallow language understanding to process Manuscript received October 30, 2009. This work was supported in part by the XXX research project. L. ABOUENOUR is with the Mohammadia School of Engineers, Agdal, Rabat, Morocco. phone: (+212) 664 06 35 01; e-mail: abouenour@yahoo.fr. K. BOUZOUBA is with the Mohammadia School of Engineers, Agdal, Rabat, Morocco. e-mail: karim.bouzouba@emi.ac.ma. P. ROSSO is with the Natural Language Engineering Lab., Dpto. Sistemas Informáticos y Computación, Universidad Politécnica Valencia, Spain. Email : prosso@dsic.upv.es. 1 http://www.raya.com questions and it does not attempt to understand the content of the question at a deep, semantic level. AQAS [10] is knowledge-based and, therefore, extracts answers only from structured data and not from raw text (non structured text written in natural language). ArabiQA [21] is an Arabic Q/A prototype based on the Java Information Retrieval System (JIRS) 2 [22] Passage Retrieval (PR) system and a Named Entities Recognition (NER) module. It embeds an Answer Extraction module dedicated especially to factoid questions. In order to implement this module authors developed an Arabic NER system [20] and a set of patterns manually built for each type of question. QASAL [26] is a recent attempt for building an Arabic Q/A which process factoid questions (e.g. questions that have NE answers). Experiment have been conducted and showed that for a test data of 50 questions the system reached 67,65% as precision, 91% as recall and 72,85% as F-mesure. AQAS and QARAB offered for the community of researchers in the field of the Arabic Natural Language Processing (NLP) the first prototypes of Arabic Q/A systems. However, since these systems process only structured data, their use in an open domain such as the web is not possible. ArabiQA and QASAL are more developed especially in the processing of factoid questions. The former integrates a NER that has been evaluated and tested using well-known test data. The latter has been also tested but the two tests have used a lower number of questions. The two systems have not been evaluated in an open domain collection such as the web. On another hand, regardless the processed language, a regular Q/A system has the architecture illustrated in Figure 1 below. These systems include three modules: (i) Question analysis and classification module: this module contains the first processes applied to a question. A question has to be analyzed in order to extract its keywords that are not stopwords, identify the class of the question (for instance factoid, definition, etc), identify the structure of the expected answer, form the query to be passed to the PR module, etc. 2 http://sourceforge.net/projects/jirs An evaluated semantic QE and structure-based approach for enhancing Arabic Q/A Lahsen ABOUENOUR, Karim Bouzouba, and Paolo Rosso Q