Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1 , S. Rosset 1 , C. Ayache 2 , D. Mostefa 2 , J. Turmo 3 , P. Comas 3 LIMSI-CNRS, ELDA/ELRA, TALP Research Center (UPC) Orsay - France, Paris - France, Barcelona - Spain {lamel,rosset}@limsi.fr, {ayache,mostefa}@elda.org, {turmo,pcomas}@lsi.upc.edu Abstract This paper reports on the QAST track of CLEF aiming to evaluate Question Answering on Speech Transcriptions. Accessing information in spoken documents provides additional challenges to those of text-based QA, needing to address the characteristics of spoken language, as well as errors in the case of automatic transcriptions of spontaneous speech. The framework and results of the pilot QAst evaluation held as part of CLEF 2007 is described, illustrating some of the additional challenges posed by QA in spoken documents relative to written ones. The current plans for future multiple-language and multiple-task QAst evaluations are described. 1. Introduction There are two main paradigms used to search for informa- tion: document retrieval and precise information retrieval. In the first approach, documents matching a user query are returned. The match is often based on some keywords that were extracted from the query, and the underlying assumption is that the topic of the documents best matching the query provide a data pool from which the user might find information that suits their need. This need can be very specific (e.g. Who is presiding the Senate?), or it can be topic-oriented (e.g. I’d like information about the Senate). The user is left to filter through the returned documents to find the desired information, which is quite appropriate for the more general topic-oriented questions, and less well-adapted to the more specific one. The second approach to search, which is better suited to the specific queries, is embodied by so-called question answering (QA) systems, which return the most probable answer given a specific question (e.g. The answer to Who won the 2005 Tour de France? is Lance Armstrong.). In the QA and Information Retrieval domains progress has been assessed via evaluation campaigns (Ayache et al., 2006; Kando, 2006; Voorhees and Buckland, 2007; Nunzio et al., 2007; Giampiccolo et al., 2007). In the Question-Answering evaluations, the systems handle inde- pendent questions and should provide one answer to each question, extracted from textual data, for both open domain and restricted domain. Since much of human interaction is via spoken language ( e.g. meetings, seminars, lectures, telephone conversations), it is interesting to explore apply- ing QA on speech data. Accessing information in spoken language requires significant departures from traditional text-based approaches in order to deal with transcripts (manual or automatic) of spontaneous speech. Much of the QA research carried by natural language groups have typically developed techniques for written texts which are assumed to have a correct syntactic and semantic structure. Spoken data is different from textual data in various ways: it contains disfluencies, false starts, speaker corrections, truncated words. The grammatical structure of spontaneous speech is quite different than for written discourse. Moreover, spoken data can be meetings which show a complete different global structure (for instance, interaction creates run-on sentences where the distance between the first part of an utterance and the last one can be very long). In 2007, a pilot evaluation campaign, partially sponsored by the FP6 CHIL project, was carried out under the CLEF umbrella for the evaluation of QA systems on Speech Transcriptions: the QAST evaluation (Turmo et al., 2007). The remainder of this paper is organized as follows. First the next section presents the QAst 2007 tasks, and is fol- lowed by a description of the 2007 evaluation in Sec- tion 3.. This is followed by a discussion of the results and plans for the 2008 evaluation in Section 4.. The tasks for 2008 and evaluation plans have been modified based on the pilot evaluation in order to allow better comparison be- tween textual Question-Answering and Speech Question- Answering tasks, and to assess Question-Answering on au- tomatic speech transcripts with different error rates (reflect- ing the quality of the automatic speech recognition sys- tems). 2. The QAst 2007 Tasks The design of the QAST tasks attempted to take into ac- count two different viewpoints. Firstly, automatic tran- scripts of speech data contain recognition errors which can potentially lead to incorrectly answered questions or unan- swered questions. In order to measure the loss of the QA systems due to automatic speech recognition (ASR) technology, a comparative evaluation was introduced for both manual and automatic transcripts. Secondly, dealing with speech from single speakers (monologues) is different than dealing with multi-speaker interactions (dialogues). With the aim of comparing the performance of QA sys- tems for both monologues and dialogues, two scenarios were introduced: lectures and meetings in English from the CHIL (CHIL, 2004 2007) and AMI (AMI, 2005) projects. From the combination of these two viewpoints, QAST cov- ered the following four tasks: • T1: Question Answering in manual transcripts of lec- tures 1995