An Answer Updating Approach to Novelty Detection Xiaoyan Li Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst MA 01003 W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst MA 01003 ABSTRACT The detection of new and novel information in a document stream is an important component of potential applications. This paper describes an answer updating approach to novelty detection at the sentence level. Specifically, we explore the use of question- answering techniques for novelty detection. New information is defined as new/previously unseen answers to questions representing a user’s information need. A sentence is treated as novel sentence if the system believes that it may contain a previously unseen answer to the question. In our answer updating approach, there are two important steps: question formulation and new answer detection. Experiments were carried out on data from the TREC 2003 novelty track using the proposed approach. The results show that the proposed answer updating approach outperforms all three baselines in terms of precision at low recall. Keywords Novelty detection, question answering, named entities 1. INTRODUCTION The goal of research on novelty detection is to provide a user with a list of materials that are relevant and contain new information with respect to a user’s information need. The goal is for the user to quickly get useful information without going through a lot of redundant information, which is a tedious and time-consuming task. A variety of novelty measures have been described in the literature [6, 7, 23]. These definitions of novelty, however, are quite vague and seem only indirectly related to the intuitive notions of novelty. Usually new words appearing in an incoming sentence/story/document contribute to the novelty scores in various novelty measures though in different ways. We give a definition of novelty as new answers to the potential questions representing a user’s request or information need. If a new answer to the question, which represents the user’s information need or part of it, appears in a sentence or story or document, then we say the sentence (story or document) has new information that the user wants. Given this definition of novelty, it is possible to detect new information by monitoring how the answer to a question changes. Therefore, we propose to perform novelty detection via answer updating. This approach is made even more feasible by the progress in ongoing research on question answering techniques [14]. The rest of the paper is organized as follows. Section 2 gives a short overview of related work on novelty detection. Section 3 introduces our new definition of novelty, and elaborates a new perspective of novelty understanding with an analysis of the TREC novelty track data. Section 4 describes the proposed answer updating approach for novelty detection and explains how novelty detection can be done via answer updating. Experimental design and results are shown in Section 5. Section 6 gives a brief discussion on challenges in data collections for testing various novelty detection approaches. Section 7 summarizes the paper with conclusions and future work. 2. RELATED WORK Novelty detection has been done at three different levels: event level, sentence level and document level. Work on novelty detection at the event level arises from the Topic Detection and Tracking (TDT) research, which is concerned with online new event detection/first story detection [1,2,3,4,5,16,18]. Current techniques on new event detection are usually based on clustering algorithms. Some model (vector space model, language model, lexical chain, etc.) is used to represent each incoming news story/document. Each story is then grouped into clusters. An incoming story will either be grouped into the closest cluster if the similarity score between them is above the preset similarity threshold or start a new cluster. A story which started a new cluster will be marked as the first story about a new topic, or it will be marked as “old” (about an old event) if there exists a novelty threshold and the similarity score between the story and its closest cluster is greater than the novelty score. Research on novelty detection at the sentence level is related to the TREC novelty track for finding relevant and novel sentences given a topic and an ordered list of relevant documents [7, 8, 9, 10, 11, 12, 13, 23]. Novelty detection could be also performed at the document level, for example, in Zhang et al’s work [13] on novelty and redundancy detection in adaptive filtering, and in Zhai et al’s work [17] on subtopic retrieval. In current techniques developed for novelty detection at the sentence level or document level, new words appearing in sentences/documents usually contribute to the scores that are used to rank sentences/documents. Many similarity functions used in information retrieval (IR) are also tried in novelty detection. Usually a high similarity score between a sentence and a given query will increase the relevance rank of the sentence while a high similarity score between the sentence and all previously seen sentences will decrease the novelty rank of the sentence, for example, the Maximal Marginal Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.