Using the Ellogon Natural Language Engineering Infrastructure Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, and Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research (N.C.S.R.) “Demokritos”, P.O. BOX 60228, Aghia Paraskevi, GR-153 10, Athens, Greece. e-mail: {petasis, vangelis, paliourg, costass}@iit.demokritos.gr Abstract Ellogon is a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon has been used extensively in various NLP applications. It is currently provided for free for research use to research and academic organisations. In this paper, we outline its architecture and data model, present Ellogon features as used by dif- ferent types of users and discuss its functionalities against other infrastructures for language engineering. 1. Introduction Ellogon was developed by the Software and Knowl- edge Engineering Laboratory of the Institute of In- formatics and Telecommunications, N.C.S.R. “De- mokritos”, Greece 1 . Its development started in 1998. The motivation for its development, at that time, was the inadequacy of existing platforms to support important features for the NLP applications of the SKEL laboratory. Being in constant development since 1998, Ellogon is now a mature and well tested infrastructure, as it has been used in many national and European projects either by SKEL or its partners in some of these projects. Ellogon facilities have been used for creating a wide range of NLP applica- tions, such as various linguistic tools, information filtering and extraction systems in several European languages, controlled language checkers. Since 2002, Ellogon is provided for free, for research use, to re- search and academic organisations. Ellogon as a text engineering platform offers an extensive set of facilities, including tools for process- ing and visualising textual/HTML/XML data and associated linguistic information, support for lexical resources (like creating and embedding lexicons), tools for creating annotated corpora, accessing data- bases, comparing annotated data, or transforming linguistic information into vectors for use with vari- ous machine learning algorithms. The rest of the paper is organised as follows: In section 2 Ellogon architecture, data model and some important features are briefly described. Section 3 presents some features that facilitate its use by dif- ferent types of users, whereas section 4 gives exam- ples of certain NLP applications where Ellogon was employed. Section 5 discusses Ellogon functional- 1 http://www.iit.demokritos.gr/skel/Ellogon ities against existing infrastructures. Finally, section 6 presents some concluding remarks and future plans. 2. Ellogon Architecture and Data Model During the last decade, a large number of software infrastructures aiming at facilitating R&D in the field of natural language processing have been presented. Some of these infrastructures, such as LT-NSL/LT- XML tools (McKelvie et. al. 1997) or GATE 2 , have become extremely popular as they have been applied to a wide range of tasks by many institutions around the world. Ellogon belongs to the category of referential or annotation based platforms, where the linguistic in- formation is stored separately from the textual data, having references back to the original text. Based on the TIPSTER data model (Grishman, 1997), Ellogon provides infrastructure for: • Managing, storing and exchanging textual data as well as the associated linguistic information. • Creating, embedding and managing linguistic processing components. • Facilitating communication among different lin- guistic components by defining a suitable pro- gramming interface (API). • Visualising textual data and associated linguistic information. The architecture of Ellogon, the utilised data model and the linguistic processing components as well as some key features of Ellogon are presented in the following sub-sections. 2.1. Ellogon Architecture Ellogon can be used either as an NLP integrated de- velopment environment (IDE) or as a library that can 2 Information about GATE can be found at http://gate.ac.uk