Supporting Factual Statements with Evidence from the Web Chee Wee Leong Computer Science and Engineering University of North Texas Denton, TX, 76207 cheeweeleong@my.unt.edu Silviu Cucerzan Microsoft Research Microsoft Way Redmond, WA, 98052 silviu@microsoft.com ABSTRACT Fact verification has become an important task due to the increased popularity of blogs, discussion groups, and social sites, as well as of encyclopedic collections that aggregate content from many contributors. We investigate the task of automatically retrieving supporting evidence from the Web for factual statements. Using Wikipedia as a starting point, we derive a large corpus of state- ments paired with supporting Web documents, which we employ further as training and test data under the assumption that the con- tributed references to Wikipedia represent some of the most rele- vant Web documents for supporting the corresponding statements. Given a factual statement, the proposed system first transforms it into a set of semantic terms by using machine learning techniques. It then employs a quasi-random strategy for selecting subsets of the semantic terms according to topical likelihood. These semantic terms are used to construct queries for retrieving Web documents via a Web search API. Finally, the retrieved documents are aggre- gated and re-ranked by employing additional measures of their suit- ability to support the factual statement. To gauge the quality of the retrieved evidence, we conduct a user study through Amazon Mechanical Turk, which shows that our system is capable of re- trieving supporting Web documents comparable to those chosen by Wikipedia contributors. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Experimentation, Measurement Keywords Fact verification, supporting evidence, Wikipedia, Web search, Web references, semantic term extraction. 1. INTRODUCTION Web search engines have become the de facto standard for re- trieving relevant data for informational needs that can be expressed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’12, October 29–November 2, 2012, Maui, HI, USA. Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00. in short queries, such as “Barack Obama major” or “Obama biogra- phy”. However, enabling search engines to provide evidence for a complex factual statement, such as “In 1981, Obama transferred to Columbia University in New York City, where he majored in polit- ical science with a specialty in international relations” is largely an unexplored research problem. Retrieving evidence for such state- ments typically requires that users formulate short queries that con- tain words likely to occur all together on relevant Web pages. Fur- thermore, the snippets returned by Web search engines focus on the words of those short queries, making it hard for a user to determine if the retrieved Web pages support the factual statement without further navigation to the actual Web page content. Is what one reads also what one believes to be true? The abil- ity to verify factual information quickly is crucial in many busi- ness scenarios (e.g., politics, media, stock market, retail), as well as for the day-to-day needs of individual users. Fact verification has become particularly important due to the increased popularity of blogs, discussion groups, and social sites, as well as of encyclo- pedic collections that aggregate content provided by many contrib- utors, such as Wikipedia and IMDB. While such collections gener- ally provide accurate information, each factual statement may re- quire additional verification/support due to the nature of the open contribution process. Currently, Wikipedia requires contributors to provide reliable sources for the edited content whenever possible, particularly for factual statements of controversial nature. This re- quirement presents both new annotated data opportunities and data annotation tool needs. On one hand, it resulted in numerous Web page citations being added to Wikipedia, which provides research opportunities for investigating a large collection of factual data an- notated with references to supporting Web evidence. On the other hand, due to the size of the task, the majority of facts stated in Wikipedia still lack proper references; thus, building a tool that helps contributors and editors easily retrieve and/or improve such references can have a positive impact for Wikipedia and the Web community. Our current objective is to investigate the retrieval of Web evi- dence to support any general factual statement. Since our inves- tigation starts from the Wikipedia collection, this also provides a concrete method that can be employed by Wikipedia contributors to create relevant citations to Web sources. While assertions made in any particular statement may be questionable, we do not address here the task of determining their validity. The focus of the paper is the retrieval of the best supporting Web evidence for a factual statement as given. Future work will address models for retrieving also contradicting evidence to provide counterclaims to assertions of input factual statements. We use Wikipedia as a starting point to derive a large collection of factual statements with supporting Web evidence, which we em-