Medical Term Extraction in an Arabic Medical Corpus Doaa Samy*, Antonio Moreno-Sandoval*, Conchi Bueno-Díaz*, Marta Garrote-Salazar* and José M. Guirao Cairo University, Faculty of Arts, Egypt *Computational Linguistics Laboratory-Autónoma University Madrid Granada University doaasamy@cu.edu.eg , antonio.msandoval@uam.es , diazmunio@hotmail.com , marta.garrote@uam.es , jmguirao@ugr.es Abstract This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus. The experiments and the corpus are developed within the framework of Multimedica project funded by the Spanish Ministry of Science and Innovation and aiming at developing multilingual resources and tools for processing of newswire texts in the Health domain. The first experiment uses a fixed list of medical terms, the second experiment uses a list of Arabic equivalents of very limited list of common Latin prefix and suffix used in medical terms. Results show that using equivalents of Latin suffix and prefix outperforms the fixed list. The paper starts with an introduction, followed by a description of the state-of-art in the field of Arabic Medical Language Resources (LRs). The third section describes the corpus and its characteristics. The fourth and the fifth sections explain the lists used and the results of the experiments carried out on a sub-corpus for evaluation. The last section analyzes the results outlining the conclusions and future work. Keywords: Arabic Medical Language Resources, Arabic Medical Terms, Term Extraction. 1. Introduction This paper presents an experiment carried out within MULTIMEDICA project. The experiment goal is to test two different strategies for medical term extraction in an Arabic corpus: the first one is based on a list of specific medical terms in Arabic in their full form; and the second one is a list of Arabic equivalents of Latin prefix and suffix commonly used in the medical and health domain. Arabic equivalents are words that can form part of compound terms. For example, the first list includes terms in its complete form such as the term “conjunctivitis” and its Arabic translation “ب اا”. The second list includes only the Latin suffix “-itis” and its Arabic equivalent which is in this case is “با”. As a test dataset, an Arabic Medical corpus has been built from Health sections in Arabic newswire texts and health portals. Thus, the experiments carried out and described in this paper offer the community new resources in the Arabic medical and health domain (corpus and terminological database). Multimedica is a project funded by the Spanish Ministry of Science and Innovation. The project aims at developing multilingual resources and tools for processing of newswire texts in the Health domain. Languages covered in the project are: Spanish, Arabic and Japanese. Developed resources and tools will be included in a translation and terminology portal targeting students and professors at Spanish universities. This portal will include a term extractor applied to comparable corpora in Spanish, Arabic and Japanese. In this paper, we will outline the methodology applied on Arabic language. The abstract is divided into four sections: a review of the state-of-art in Arabic medical Language Resources (LR), building the corpus, terminological lists and, finally, experiments and results. 2. State-of-the-Art in Arabic Medical LR State-of-art in Arabic medical and health domain represents some challenges when addressing language resources and tools. These challenges are due to certain practices adopted by practitioners and specialists within the medical and health domain in many countries across the Arab World. The main challenge in addressing Arabic Language Resources (LR) in health sciences is the clear diglossia phenomenon prominent among specialists and practitioners in the field. The basic definition of diglossia, according to Charles Ferguson (1959), refers to a linguistic phenomenon, mainly, a sociolonguistic phenomenon where two languages or two dialects are used by the same community in different social situations for different social purposes. Diglossia can be observed in the following aspects: - First, Arabic is not the language used in teaching Medicine, Pharmacy and other health related programmes at the university level in many Arab countries. Instead, English or French are used as lingua franca. In Morocco, Tunisia and Algeria French is used, while in Egypt, Iraq, Jordan, Saudi Arabic and Gulf countries, English language is used. Syria is the only exception where Arabic is used in teaching and health practices. - Second, English or French are the languages used in professional practices within the health domain in Arab 640