A question–answer distance measure to investigate QA system progress Guillaume Bernard 1 , Sophie Rosset 1 , Martine Adda-Decker 1 and Olivier Galibert 2 1 LIMSI. Paris. France, 2 LNE. Paris. France {rosset,madda,gbernard}@limsi.fr, Olivier.Galibert@lne.fr Abstract The performance of question answering system is evaluated through successive evaluations campaigns. A set of questions are given to the participating systems which are to find the correct answer in a collection of documents. The creation process of the questions may change from one evaluation to the next. This may entail an uncontroled question difficulty shift. For the QAst 2009 evaluation campaign, a new procedure was adopted to build the questions. Comparing results of QAst 2008 and QAst 2009 evaluations, a strong performance loss could be measured in 2009 for French and English, while the Spanish systems globally made progress. The measured loss might be related to this new way of elaborating questions. The general purpose of this paper is to propose a measure to calibrate the difficulty of a question set. In particular, a reasonable measure should output higher values for 2009 than for 2008. The proposed measure relies on a distance measure between the critical elements of a question and those of the associated correct answer. An increase of the proposed distance measure for the 2009 evaluation as compared to 2008 could be established. This increase correlates with the previously observed degraded performances. We conclude on the potential of this evaluation criterion: the importance of such a measure for the elaboration of new question corpora for questions answering systems and a tool to control the level of difficulty for successive evaluation campaigns. 1. Introduction The questions-answering (QA) task consists of providing short, relevant answers to natural language questions. QA research has focused on extracting information from text or spoken sources, providing the shortest relevant text in response to a question. For example, the correct answer to the question Besides France and Germany, where have we seen cases of mad cow-like disease affecting goats? is Belgium 1 instead of a list of documents. This simple example illustrates the two main advantages of QA over current search engines: First, the input is a natural-language question rather than a keyword query; and second, the answer provides the desired information content and not simply a potentially large set of documents or URLs that the user must plow through. In the QA domain progress has been observed via eval- uation campaigns ((Dang et al., 2007; Mitamura et al., 2008; Forner et al., 2008; Turmo et al., 2008)). The QAst (Questions-Answering on Speech Transcriptions) campaigns focus on evaluating QA systems on speech transcriptions. Oral sentences have different features than the written one (long sentences for instance), and the aim is to evaluate the systems on this type of data. Moreover, the system are evaluated on three different languages: French, English and Spanish. In the QAst 2009 evaluation (Turmo et al., 2009), a new procedure for building the question corpus has been proposed. In the previous QAst evaluations (Turmo et al., 2008), the questions were created by the evaluators from the documents. In 2009, the objective was to build more spontaneous questions. Native speakers were requested to read excerpts of documents and to ask, using speech, 1 This question is extracted from the QAST 2008 development set and this is the corresponding answer found in the document collection. questions about information related to but not included in these excerpts. Because of this new building procedure, the correct answer to a question can be potentially far away from the excerpt use to create the question, specially with the long sentences found in oral transcriptions. Thus, we aim to evaluate whether this new building procedure has an impact on the results obtained on the QAst 2008 campaign. In this paper, we propose a new measure based on the dis- tance between the answer to a question and its elements, to evaluate whether the difficulty of the task had changed as a result. First, we compare the results obtained on the 2008 and 2009 QAst evaluations. We then motivate and describe our measure, which is applied on the questions corpus of 2008 and 2009 for each language (French, En- glish and Spanish). We analyze the results and finally we conclude on the potential of this measure to assist in the building of new questions corpus in evaluation campaigns. 2. Observations on QAst 2008 and 2009 results A first observation comes from the general results obtained by all the participants: they all went down (Turmo et al., 2009). There was three similar tasks between the QAst 2008 and 2009 evaluations: question-answering on English EPPS data, Spanish EPPS data and French broadcast news. In 2009 two question sets were pro- posed: one with written questions and one with manually transcribed spoken questions. Table 1 shows the results obtained by the 2008 version of our systems and the 2009 update of the same systems on the test corpus of QAst 2008. The results on each of the tasks have improved with the 2009 version. The greater gap for the English and Spanish tasks can be explained in part because of the different type of data: English and Spanish tasks use a corpora built from European Parliament plenary sessions and the French task uses a broadcast news corpora. 2758