Towards a Unified Medical Lexicon for French Pierre Zweigenbaum, Robert Baud, Anita Burgun, Fiammetta Namer, Éric Jarrousse, Natalia Grabar, Patrick Ruch, Franck Le Duff, Benoît Thirion, Stéfan Darmoni STIM/DSI, Assistance Publique – Hôpitaux de Paris, France DIM, Hôpitaux Universitaires de Genève, Suisse LIM, Centre Hospitalier Régional Universitaire de Rennes, France ATILF, Université Nancy 2, France VIDAL, Paris, France L@STICS, Centre Hospitalier Universitaire de Rouen, France Abstract Medical Informatics has a constant need for basic Medical Language Processing tasks, e.g., for coding into controlled vocabularies, free text indexing and information retrieval. Most of these tasks involve term matching and rely on lexical resources: lists of words with attached information, including in- flected forms and derived words, etc. Such resources are publicly available for the English language with the UMLS Specialist Lexicon, but not in other languages. For the French language, several teams have worked on the subject and built local lexical resources. The goal of the present work is to pool and unify these resources and to add extensively to them by exploiting medical terminologies and corpora, resulting in a unified medical lexicon for French (UMLF). This paper exposes the issues raised by such an objective, describes the methods on which the project relies and illustrates them with experimental results. Keywords: Natural Language Processing; Language; France; Controlled Vocabulary; Algorithms; Funding, Non-US Gov- ernment 1 Introduction Basic natural language resources such as those in the UMLS Specialist Lexicon [1] are a key asset for Medical Informatics. Lists of words with attached morphosyntactic information (e.g., “stenoses”, noun, plural) can be useful for extracting terms from medical texts [2], where accurate syntactic tagging is instrumental to successful text analysis. Relating inflected forms and derived forms to their base words adds power and flexibility to term matching: e.g., mapping into UMLS with Metamap [2]. This also enhances information retrieval, especially with inflected languages such as French, for instance when mapping into French MeSH in CISMeF [3,4], allowing ‘semantic’ navigation instead of a restrictive hierarchical navigation. More generally, access to knowledge bases, whether indexed with controlled vocabularies (e.g., the VIDAL drug knowledge base for hospital intranets, www.vidalcim.net) or not (e.g., the ADM knowledge base on diseases [5]), is facilitated by lexical knowledge. This is also an asset for coding diagnoses into WHO’s ICD-10 or ICF classifications. Such lexical knowledge is available for medical English in the UMLS Specialist Lexicon [1] and for general English (as well as Dutch and German) in the CELEX base [6]. A medical