BOEMIE ontology-based text annotation tool Pavlina Fragkou, Georgios Petasis, Aris Theodorakos, Vangelis Karkaletsis, Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications National Center for Scientific Research (N.C.S.R.) “Demokritos” P. Grigoriou and Neapoleos Str., 15310 Aghia Paraskevi Attikis, Greece E-mail: {fragou, petasis, artheo, vangelis, costass}@ iit.demokritos.gr Abstract The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn extraction models. The production of such corpora can be significantly facilitated by annotation tools that are able to annotate, according to a defined ontology, not only named entities but most importantly relations between them. This paper describes the BOEMIE ontology-based annotation tool which is able to locate blocks of text that correspond to specific types of named entities, fill tables corresponding to ontology concepts with those named entities and link the filled tables based on relations defined in the domain ontology. Additionally, it can perform annotation of blocks of text that refer to the same topic. The tool has a user-friendly interface, supports automatic pre-annotation, annotation comparison as well as customization to other annotation schemata. The annotation tool has been used in a large scale annotation task involving 3000 web pages regarding athletics. It has also been used in another annotation task involving 503 web pages with medical information, in different languages. 1. Introduction Nowadays, the vast amount of information available in the Web remains in a considerable degree an unexploited thesaurus of knowledge resources. Among the approaches aiming to perform effective extraction, analysis and fast search is those that try to explore the actual content of the web pages by producing metadata. Metadata is usually defined as “data about data” aiming to express the “semantics” of information, thus improving information seeking, retrieval, understanding and use. Metadata can be applied both to a wide range of documents as well as to applications available in the web in the form of web services. The most important problems in extracting metadata from documents and applications firstly lie in the description of the “semantics” of the desired information and secondly in the requirements of information extraction systems to extract those metadata. Regarding the first problem, semantics can be simply defined by the specification of a simple vocabulary however they are more effectively defined by ontologies. This is due to the fact that, ontologies not only specify the vocabulary of interest based on a set of agreed keywords each of which corresponding to a concept defined, but specify also the properties of those concepts, relations between concepts as well as formal axioms. Regarding the second problem, information extraction systems in order to be able to extract metadata must be provided with an appropriately annotated corpus. The production of such corpora can be significantly facilitated by annotation tools. Annotation tools ideally must support the annotation of the following tasks: (a) annotation of named entities i.e. of certain types of proper names (b) annotation of possible relations between the named entities, such as the relation between an employee and a company (c) annotation of a structure with extracted information involving several relations and events of interest, based on previous steps, such as annotation of the positions, companies and persons involved in high-level management succession events. In all steps, the annotation is guided by the semantics defined. While a considerable number of annotation tools can be found in the literature, the majority of those are restricted to the annotation of named entities. Additionally, there are unable to cope with the annotation of large corpora. In order to overcome those problems and to perform annotation for the purposes of the BOEMIE (Bootstrapping Ontology Evolution with Multimedia Information Extraction) project (www.boemie.org ) for the text modality, a new text annotation tool was developed by extended an already existing one. This tool supports the annotation of named entities, the so-called Middle level concepts (MLCs) of the domain ontology, using the BOEMIE terminology. A part of the BOEMIE ontology concerning the text modality is depicted in Fig.1. The annotation tool enables also the annotation of relations between those named entities. These relations are grouped in tables of specific types. Tables correspond to High Level Concepts (HLCs) of the domain ontology, (according to BOEMIE terminology) having their fields filled with MLC instances. Furthermore, the tool enables the annotation of relations between HLC instances by creating linkages between tables in an effective and easy way. An additional 1273