Eurospeech 2001 - Scandinavia Concordancing for Parallel Spoken Language Corpora Dafydd Gibbon, Thorsten Trippel Universit¨ at Bielefeld gibbon@spectrum.uni-bielefeld.de ttrippel@spectrum.uni-bielefeld.de Serge Sharoff Humboldt Fellow, Universit¨ at Bielefeld Russian Res. Inst. for AI sharof@aha.ru Abstract Concordancing is one of the oldest corpus analysis tools, es- pecially for written corpora. In NLP concordancing appears in training of speech-recognition system. Additionally, compar- ative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new ap- proach. We propose to combine these fields of interest for a multi-purpose concordance for Spoken Language Data, open- ing the opportunity of combining corpus-linguistic and NLP methods resulting in a broader empirical basis for NLP research. Theoretic models for audio-concordances are discussed. Princi- ples of the structure and design of a parallel audio concordance are given, coding by means of XML to ensure reusability and flexibility, using time stamps for referencing from annotations to the signal. 1. Introduction One of the most commonly used corpus analysis tools, and cer- tainly the oldest, 1 is the text concordance, traditionally defined as a table in which words which occur in a text are paired with citations of the text passages in which they occur. The art of concordancing has reached a peak in computational corpus lin- guistics, with access criteria which include not only word keys but also linear or hierarchical tagging, the choice of static (pre- compiled) concordances or dynamic (on-the-fly, free-key) con- cordances [1]. Users of speech corpora are not so fortunate, despite the fact that the field of audio indexing is developing rapidly, and despite concordance-like techniques for using annotated speech signals to access spoken language corpora in order to train stochastic models for automatic speech recognition. Most tagged spoken language corpora and treebanks still restrict themselves to transcriptions, i.e. textual representations, and do not in general provide access to the speech signal. In this paper, we examine and define the notion of au- dio concordance, discuss an implementation, and suggest that a standardised approach to audio concordancing for unilin- gual and multilingual corpora would provide valuable heuristic support tools for a wide range of linguistic and information- retrieval activities. In particular, we address the question of concordances for parallel aligned speech corpora. We use the German VERBMOBIL speech-to-speech trans- lation corpus, a corpus with German, English and Japanese data, including both monolingual and multilingual dialogues. For testing we selected the dialogue M872B of [2], a bilingual- dialogue for English and German, 435 seconds (approx. 7 1 The technique dates back at least to the Middle Ages; the oldest reference to ‘concordaunce’ in this sense given by the Oxford English Dictionary is 1387. min.). The speech signals are transcribed (in Verbmobil ter- minology ‘transliterated’) and annotated following the VERB- MOBIL conventions, which were converted to an XML format following [3]. For reusability in different contexts, XML based formats are used for the concordance, based on the formats specified in [4], extended for multinlingual and parallel texts. 2. Characterisations and definitions 2.1. Audio concordance We start with some straightforward and fairly evident character- isations and move on to complex corpora and correspondingly more complex notions of concordance. First, extending the traditional definition, we provide an ini- tial (and partial) characterisation of audio concordance: An audio concordance is a table in which repre- sentations of units which occur in an annotated spoken language corpus are paired with citations from the annotations in which they occur. For present purposes, our characterisation of annotation is: An annotation is a pair consisting of a symbol and a time-stamp, where the symbol represents some linguistic property of a speech signal and the time-stamp represents the temporal location of this property in the speech signal. Thus, a transcription of a recording, and the recording which it transcribes, implicitly constitute a minimal annotation , being a recording and being a transcription of it, if the start and end of the recording are understood to be aligned with the start and end of the transcription, respectively. In the general case, the time-stamps can be representations of temporal points (e.g. offsets from the beginning of a speech file) or intervals represented as pairs of points, and the proper- ties can be any arbitrarily complex linguistic unit, from phones through syllables to words or longer units, or linear or hierarchi- cal categories (i.e. generalisations over classes of such units). We refer to the hierarchy of annotation domains as annotation granularity. Broadly speaking, in this characterisation we thus follow the annotation graph approach of Bird & Liberman [5] as ap- plied to speech signals. We characterise annotation graphs as sets of annotations of the same recording. We now turn to some details in preparation for a discussion of the concordancing of parallel corpora. Let be a set of symbolic representations (e.g. words) in a language , , and an annotated corpus with ele- ments in , which is an annotation of some