Recent Improvements to the ATLAS Architecture Christophe Laprun, Jonathan Fiscus, John Garofolo, Sylvain Pajot National Institute of Standards and Technology 100 Bureau Drive Mail Stop 8940 Gaithersburg, MD 20899-8940 (+1) 301 975 3191 {claprun, jfiscus, jgarofolo, pajot}@nist.gov ABSTRACT We examine the recent improvements that were made to the ATLAS (Architecture and Tools for Linguistic Analysis Systems) architecture. We first introduce the architecture and the historical context for this work. Next, we describe NIST’s initial implementation of the framework before analyzing it. We then focus on three important improvements (relating to multi-dimensional signals, hierarchical structures and validation) we have made to the architecture to make it more usable. We conclude by summarizing the major points covered and discuss plans for future work. Keywords ATLAS, MAIA, Linguistic infrastructure 1. INTRODUCTION Annotated corpora are a central component of research in human language technology. As corpora have proliferated across languages, disciplines, and technologies, the lack of common exchange and storage formats has become a critical problem. This profusion of formats has made reusing annotated data or adapting existing tools for new annotation tasks significantly more difficult. The standardization of tag sets (an approach we tried with our Universal Transcription Format [5]) is of moderate usefulness since language research is by necessity an open- ended task, subject to constant revision as the research domains change and the theories evolve. A solution to this "bazaar of tools and formats" [2] is to interpose a generic annotation model via which annotation data is manipulated. ATLAS (Architecture and Tools for Linguistic Analysis Systems) makes use of such a generic data model. We first examine the historical context that led to the creation of the project. We then briefly describe the first implementation of the architecture, singling out three aspects of the architecture that needed to be improved: handling of complex signals and hierarchical structures and validation. Each of these aspects is then discussed in detail in subsequent sections. We conclude by summing up the major points we covered and suggest future work. 2. HISTORICAL CONTEXT The ATLAS project started as a collaboration between the LDC, MITRE and NIST in 1999 following Bird and Liberman' s seminal work on Annotation Graphs (AGs) [1] that demonstrated commonality across a diverse range of annotation practices and defined a formalism based on labeled, directed acyclic graphs. The three parties recognized the urgent need for a consistent way to represent and process annotation data. NIST needed such a framework to accommodate constantly evolving needs in linguistic evaluation. The LDC was developing the AG formalism in order to develop an infrastructure that would help reduce the cost of linguistic annotation while MITRE was interested in extending their Alembic Workbench annotation tool to support new domains. NIST recognized the importance of the LDC’s work on AGs and decided to form a working group to explore the creation of a generic annotation framework and toolset that would address three important issues for the linguistic research community. First, ATLAS would promote language corpora reuse and exchange. By providing a generic annotation framework, ATLAS would make it easier to share data since data annotated with a generic representation could be reused in new contexts. Second,