To appear in the Proceeding of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), University of Technology of Com- piegne, France, 19-22 September 2005. Evaluation of Response Quality for Heterogeneous Question Answering Systems Wilson Wong & Shahrin Sahib Faculty of Information and Communication Technology, Kolej Universiti Teknikal Kebang- saan Malaysia, Melaka, Malaysia {wilson, shahrinsahib}@kutkm.edu.my Ong-Sing Goh School of Information Technology, Murdoch University, Perth, Western Australia, 6150 osgoh88@gmail.com Abstract The research in this paper makes explicit why existing measures for response quality evaluation is not suitable for the ever-evolving field of question answering and following that, a short-term solution for evaluating response quality of heterogeneous systems is put forward. To demonstrate the challenges in evaluating systems of different nature, this research presents a black-box approach using a classification scheme and scoring mechanism to assess and rank three example systems. 1. Introduction Generally, question answering systems can be catego- rized into two groups based on the approach in each dimension. The first is question answering based on shallow natural language processing and information retrieval and the second approach is question answering based on natural language understanding and reasoning. Table I summarizes the characteristics of the two approaches with respects to the dimensions in question answering. Some of the well known systems from the first approach are like Webclopedia [1] and AnswerBus [2], while examples of question answering systems from the second are like WEBCOOP [3] in tourism, NaLURI [4] in Cyberlaw and START [5]. Table I. Characteristics of the two approaches in ques- tion answering Dimen- sion Shallow natural language processing and information retrieval Natural language understanding and reasoning Technique Syntax processing and informa- tion retrieval Semantic analysis or higher, and reasoning Source Free-text documents Knowledge base Domain Open-domain Domain-oriented Response Extracted snippets Synthesized responses Question Questions using wh -words Beyond wh-words Evaluation Information retrieval metrics N/A The evaluation of question answering systems for non-dynamic responses has been largely reliant on the TREC corpus. It is easy to evaluate systems in which there is a clearly defined answer, however, for most natural language questions there is no single correct answer [6]. Evaluation can turn into a very subjective matter espe- cially when dealing with different types of natural lan- guage systems in different domains due several reasons: no baseline or comparable systems in certain domains, developing test questions is not easy, and dynamic nature of the responses, there is no right or wrong answer as there are always responses to justify the absence of an answer. 2. Existing Metrics for Question Answering The most notable evaluation for question answering has to be the question answering track in the TREC evaluation [7]. Evaluation in TREC assesses the quality of response in terms of precision and recall, and is well- suited for question answering systems based on shallow natural language processing and information retrieval like AnswerBus. There are several inherent requirements that make such evaluation inappropriate for domain-oriented question answering systems based on understanding and reasoning: assessments should average over large corpus or query collection, assessments have to be binary where answers can only be classified as correct and incorrect and assessments would be heavily skewed by corpus, making the results not translatable from one domain to another. There are also other measures but are mostly designed for general tasks related to natural language processing like translation, database query, etc. [8] proposes that a simple number scale be established for the evaluation of natural language text processing systems. This metric is to be based on the simple average of four things: size of the