From Machine Readable Dictionaries to Lexical Databases: the CONCEDE Experience TOMAŽ ERJAVEC 1 , ROGER EVANS 2 , NANCY IDE 3 , ADAM KILGARRIFF 2 (1) Dept. of Intelligent Systems Jožef Stefan Institute Ljubljana, Slovenia (2) ITRI University of Brighton Brighton, U.K. (3) Dept. of Computer Science Vassar College Poughkeepsie, USA Abstract It is commonly held that machine-readable dictionaries play a key role in bootstrapping effective wide-coverage language-technology, especially in less well-resourced languages. However, while the linguistic knowledge they contain is clearly necessary for this goal, it is far from clear that the format it is presented in is sufficient to reach it. A crucial step in the deployment of such resources is to map them into lexical databases with standardised and well-understood structure and semantics. Furthermore, considerable additional benefits are obtained if such structure and semantics are shared with other linguistic resources. Achieving such a goal, however, is often not an easy task. This paper describes how such a mapping was carried out in the CONCEDE project, for six Central and Eastern European Languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene) for which few wide-coverage lexical resources had previously been available. In a two-stage process, the machine-readable data for each language was first mapped into broadly compatible, TEI-compliant SGML representations, and then these representations were harmonised into a single XML scheme. The resulting framework offers a concise, flexible lexical database specification, with a demonstrable ability to cope with a diverse range of dictionary and language requirements, and lexical resources suitable for monolingual and multilingual application. 1. Introduction The value of language resources is greatly enhanced if they share a common markup with explicit minimal semantics. Achieving this goal for lexical databases (LDBs) is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. Furthermore, proprietary dictionary data sets developed primarily to support the production of printed dictionaries (for example, for typesetting) are notoriously difficult to formalise, due to lack of a formal specification of the data, failure to conform to specifications when provided, or simply errors (of content, structure or simply typography). The EU project CONCEDE 1 constructed lexical databases from existing machine-readable dictionaries for 1 Consortium for Central European Dictionary Encoding – INCO-COPERNICUS project no. PL96-1152. The support of the European Commission for this research is gratefully acknowledged.