A Semantic Question/Answering System using Topic Models Asli Celikyilmaz Computer Science Division University of California, Berkeley asli@eecs.berkeley.edu Abstract Bayesian Topic Models have been used in different fields of natural language pro- cessing to help extract information from unstructured text. Specifically previous research on topic model based retrieval methods has shown significant perfor- mance improvements. This paper deals with more complex models of informa- tion extraction, namely Question/Answering (QA) models. For any given question posed in natural language, QA systems are designed to extract possible answer as a semantic group or a pre-defined named-entity type, i.e., person, organization, city, etc. Thus, relating query terms with existing entities in a given corpus is crucial in QA systems. Our goal is to improve performance of our QA system by utilizing information from natural groupings of words in documents, i.e., topics, in relation to named entity types in their vicinity. Our empirical analysis indi- cate that more accurate snippets (paragraphs) containing a true answer string for a given question can be extracted via proposed topic model in comparison to the keyword search or previous topic model approaches for information retrieval. 1 Introduction and Motivation Question/Answering (QA) is a line of research in natural language processing, where a user poses a question in natural language, e.g., ”Who is the winner of nobel peace price in 2009?” (running example) and expects an answer as in word/phrases or a sentence. Thus, QA research attempts to deal with a wide range of question types including: fact, list, definition, semantically-constrained, cross-lingual questions, etc. In this research 1 , we deal with simple to complex factoid questions, where expected answer is a word or a phrase (e.g., Barack Obama of ’human’ named entity type). Before we present our new cluster-based QA model, we briefly explain a typical QA process (pipeline): Initially after a given question is broken down into keywords and semantic groups, e.g., subject, object, etc., its answer type is identified via a question classifier module [3]. An answer type is typically a pre-defined named-entity such as country, number or food type (examples in Table 1). For example, our named entity recognizer (NER) module [3], trained using conditional random fields [5], can identify up to 6 coarse and 50 fine named-entity types. Thus, the answer- type (named entity) of the running question, would be a ’Human’ course entity followed by a finer sub-group, i.e., HUMAN:Individual. Later, a document retrieval module uses a search engine to extract documents/paragraphs/sentences, namely, snippets, from entire document set (corpus) that are likely to contain the answer being sought. Usually a classifier model trained on retrieved text snippets is used to extract posterior probability of an answer text being contained in each text snip- pet, P (answer|snippet i ). Our research [3] indicate that the answer-type named entity is one of the core features that rank snippets higher when they have high likelihood of containing right answer. 1 Supported in part by ONR N00014-02-1-0294, BT Grant CT1080028046, Omron Grant, Tekes Grant, Chevron Texaco Grant, The Ministry of Communications and Information Tech. of Azerbaijan and BISC Program of UC Berkeley. 1