Toward a Standard Lexical Resource in the Bio Domain Valeria Quochi, Riccardo Del Gratta, Eva Sassolini, Monica Monachini, Nicoletta Calzolari Istituto di Linguistica Computazionale Area della Ricerca CNR, Pisa, Italy {ﬁrstname.name}@ilc.cnr.it Abstract The present paper describes the model and database structure of a large-scale lexical resource for the biology domain designed both for human and for machine use. Our lexicon aims at semantic interoperability and extendability. This is achieved through the adoption of the up-coming ISO standard for lexical representation and through a granular and distributed encoding of relevant information. The ﬁrst part of this contribution focuses on two aspects of the model that are of particular interest to the Biology community: the treatment of term variants and the alignment with ontological information. The second part of the paper describes the physical implementation of the model: a relational database equipped with a set of automatic uploading procedures. The population relies on an XML input data structure that allows for the structuring of terms with their related properties and to automatically “pull-and-push” them into the database. Peculiarity of the BioLexicon is that it combines features of both terminologies and lexicons: it is able to represent terms and their variants, with some of the semantic relations that link them, and also morphological, syntactic and lexical semantic properties of terms as lexical items. 1. Introduction Due to the increasing production of literature in the biomedical ﬁeld, intensive research is being carried out around the globe to develop language technologies to ac- cess this large body of literature and to extract knowl- edge from it, in order to make it easily available to re- searchers, students and other domain users (e.g. protein and gene databases like Uniprot and EntrezGene). Most of the resources available, however, are created mainly for human use, which makes them often not particulary useful for text mining and information extraction applica- tions. Recently, efforts have been directed to the creation of large-scale terminological resources that merge infor- mation contained in various smaller resources: large the- sauri based on a normalized nomenclature (Kors et al., 2005), extensible lexical and terminological databases like Termino (Harkema et al., 2004) and the SPECIALIST Lex- icon (NLM, 2007). Access to and interoperability of bio- logical databases, however, is still hampered, due to lack of uniformity and harmonization of both formats and in- formation encoded. The current challenge in bioinformat- ics is to construct a comprehensive and incremental re- source which integrates bio-terms encoded in existing dif- ferent databases and which encodes all relevant properties according to the most accredited standards for the repre- sentation of lexical, terminological and conceptual infor- mation (Hahn and Mark ´ o, 2001). Our paper describes both the conceptual and the physical model of a large-scale lex- ical and terminological lexicon for biology (the BioLex- icon) that is currently under development within the Eu- ropean BOOTStrep project 1 . The resource we describe learns from the state-of-the-art resources, esp. from the 1 BOOTStrep (Bootstrapping Of Ontologies and Terminolo- gies STrategic Project) is a Speciﬁc Targeted Research Project of the European Unions 6th Framework Programme within IST call 4. Six partners from four European countries (Germany, U.K., Italy, France) and one Asian partner from Singapore are involved in the project. www.bootstrep.eu SPECIALIST Lexicon and Termino, and builds on our ex- perience in the standardization and construction of lexical resources. The goal is to propose a standard for the rep- resentation of lexicons in the Bio domain, which could be ﬁnally also interoperable with other domain lexicons. 2. Related Works In this section we brieﬂy review two advanced state- of-the-art lexicons in the bio-medical domain: the UMLS SPECIALIST Lexicon and Termino. The UMLS SPECIALIST Lexicon has been developed by the NLM as a wide coverage lexical basis for UMLS NLP tools (NLM, 2007). It encodes words gathered from texts and for each word a set of morpho-syntactic and lexi- cal semantic information is speciﬁed (for example part-of- speech, complement pattern, etc.). The format, however, is designed to be optimal for its use by speciﬁc applica- tions, so it may not be easily reusable nor interoperable with other resources and/or tools. Termino, in its turn, has a more ﬂexible structure: each type of information (e.g. POS, external sources, and oth- ers) is encoded in separate tables, so that the information can be combined in different ways according to the speciﬁc user needs (Harkema et al., 2004). The model, however, seems not to conform explicitly to any established lexical standard. The design of the BioLexicon structure, presented in section 3. below, aims at merging the two main features of those two resources: richness of linguistic information en- coded and modularity and ﬂexibility. The major novelty of our proposal is the adherence to the upcoming ISO stan- dards for lexicon representation (see below) together with a detailed model suited to represent sophisticated syntac- tic and semantic information. Additionally, the BioLexi- con encodes part of the conceptual information typically contained in ontologies or thesauri: i.e. semantic relations such as is a and is part of.