Retrieval Models for Question and Answer Archives Xiaobing Xue Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst, MA, 01003, USA xuexb@cs.umass.edu Jiwoon Jeon Google, Inc. Mountain View, CA 94043, USA jjeon@google.com W. Bruce Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst, MA, 01003, USA croft@cs.umass.edu ABSTRACT Retrieval in a question and answer archive involves finding good answers for a user’s question. In contrast to typical document retrieval, a retrieval model for this task can ex- ploit question similarity as well as ranking the associated an- swers. In this paper, we propose a retrieval model that com- bines a translation-based language model for the question part with a query likelihood approach for the answer part. The proposed model incorporates word-to-word translation probabilities learned through exploiting different sources of information. Experiments show that the proposed transla- tion based language model for the question part outperforms baseline methods significantly. By combining with the query likelihood language model for the answer part, substantial additional effectiveness improvements are obtained. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Experimentation, Performance Keywords Question and Answer Retrieval, Translation Model, Lan- guage Model, Information Retrieval 1. INTRODUCTION Large scale question and answer (Q&A) archives have be- come an important information resource on the Web. These include the FAQ archives constructed by companies for their products and the archives generated from Web services such as Yahoo Answers! and Live QnA, where people answer The contributions of this author were done during graduate studies at UMass Amherst. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’08, July 20–24, 2008, Singapore. Copyright 2008 ACM 978-1-60558-164-4/08/07 ...$5.00. questions posed by other people. The retrieval task in a Q&A archive is to find relevant question-answer pairs for new questions posed by the user [6]. Q&A retrieval has several advantages over Web search. First, the user can use natural language instead of only keywords as a query, and thus can potentially express his/her information need more clearly. Second, the system returns several possible answers directly instead of a long list of ranked documents, and can therefore increase the efficiency of finding the re- quired answers. Q&A retrieval can also be considered as an alternative solution to the general Question Answering (QA) problem. Since the answers for each question in the Q&A archive are generated by humans, the difficult QA task of extracting a correct answer is transformed to the Q&A retrieval task. The major challenge for Q&A retrieval, as for most in- formation retrieval tasks, is the word mismatch between the user’s question and the question-answer pairs in the archive. For example, “what is francis scott key best known for?” and “who wrote the star spangle banner?” are two very similar questions, but they have no words in common. This problem is more serious for Q&A retrieval, since the question-answer pairs are usually short and there is little chance of finding the same content expressed using different wording. To solve the word mismatch problem, many different ap- proaches have been proposed. In this paper, we focus on translation-based approaches since the relationships between words can be explicitly modeled through word-to-word trans- lation probabilities. Berger and Lafferty [2] proposed using the classic IBM translation model 1 for information retrieval tasks 1 . How- ever, because of various fundamental differences between machine translation and information retrieval, the pure IBM model performs worse than other state of the art retrieval al- gorithms. We explain the reasons for the poor performance of the pure IBM model in the comparison with the query likelihood language model. This comparison also gives us insights that enable us to address problems with the IBM model. We propose a mixed model that leverages the benefit of both approaches. Besides designing the translation based retrieval model, another important problem is how to learn good word-to- word translation probabilities. In a Q&A archive, since the asker and the answerer may express similar meanings with different words, it is natural to use the question-answer pairs as the “parallel corpus” that is used for estimation in machine 1 IBM model 1 will be described in the following section.