Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora Elke Teich & Silvia Hansen Institute for Applied Linguistics, Translation and Interpreting (FR 4.6) University of Saarland, Germany E.Teich@mx.uni-saarland.de / S.Hansen@mx.uni-saarland.de 1. Introduction: Annotation of natural language corpora There has been an increasing interest in recent years in the enrichment of natural language corpora in terms of annotation with explicit linguistic information. This interest manifests itself most prominently in two areas of linguistics: corpus linguistics and computational linguistics. For corpus linguistics, the long standing practice has been to work on raw, i.e., unannotated text. While raw corpora are basically fine for some kinds of linguistic work, notably for lexicology and lexicography, for other kinds of linguistic analysis tasks, e.g., for syntactic or semantic analysis, the information that needs to be extracted is not readily derivable from raw text. Thus, corpora have to be annotated with linguistic categories in order to be able to extract the desired kinds of information. For such annotation to be practicable at all, the annotation process needs to be carried out automatically or at least semi-automatically. The automatic processing of large corpora, including linguistic annotation, has been a central issue in computational linguistics in recent years. Here, one of the main interests is in the statistical processing of natural language data (cf. Charniak 1993), such as statistically-based part-of-speech tagging or statistical parsing. The main purpose of these techniques is application in natural language systems, such as, for instance, in machine translation (e.g., Brown et al 1990). These techniques can also be employed for purposes of corpus-based, descriptive linguistics. In recent years, most of the large corpora of English (BNC, LOB, Bank of English etc.) have been annotated with part of speech information, which has made it possible to exploit them also for syntactic analysis. Also corpora with shallow syntactic annotation (annotation at phrase structure level) exist (e.g., the Penn Treebank (Marcus et al 1993)). What remains problematic, however, is linguistic annotation at more abstract levels of linguistic organization, notably the semantic and discourse strata. Here, annotation can only be carried out semi-automatically, e.g., with the help of tools that support interactive mark-up of texts by humans. If a corpus is to be annotated with more than one kind of annotation, we find ourselves in a situation in which the corpus exists in a number of versions, one for each kind of annotation, e.g., a syntactic one and a semantic one. This has some serious implications for the exploitation of the corpus for information extraction in that it is impossible to query the corpus with reference to more than one layer of annotation at a time. This problem has been increasingly acknowledged both in corpus linguistics and in computational linguistics. One of the paradigms proposed to overcome such difficulties is the one of document encoding, a paradigm that has been increasingly applied in humanities computing, including linguistic applications (e.g., TEI (Sperberg-McQueen & Burnard 1999), XCES (XCES 2000). The present paper is concerned with the issue of the integration of different kinds of linguistic annotation for multilingual corpora employing the paradigm of document encoding using the Extensible Mark-up Language (XML). The context in which this is of interest for us is corpus-based translation analysis; more specifically, what we are interested in is the empirical testing of hypotheses concerning the specific properties of translations when compared to original texts in the same languages as the target language and to original texts in the source language. The paper is organized as follows. First, we briefly present our analysis scenario (Section 2). Then we discuss the annotation techniques we have employed to enrich our corpora with the desired linguistic information (Section 3). In Section 4 we present a possible solution to the integration of different kinds of corpus annotation. Section 5 concludes the paper with a summary and discussion of issues for future work.