A Na¨ ıve Bag-of-Words Approach to Wikipedia QA Davide Buscaldi and Paolo Rosso Dpto. de Sistemas Informticos y Computaci´ on (DSIC), Universidad Politcnica de Valencia, Spain {dbuscaldi, prosso}@dsic.upv.es August 18, 2006 Abstract This paper presents a simple approach to the Wikipedia Question Answering pilot task in CLEF 2006. The approach ranks the snippets, retrieved using the Lucene search engine, by means of a similarity measure based on bags of words extracted from both the snippets and the articles in wikipedia. Our participation was in the monolingual English and Spanish tasks. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software General Terms Measurement, Algorithms, Experimentation Keywords Question Answering, Wikipedia 1 Introduction Question Answering (QA) based on Wikipedia (WiQA) is a novel task, proposed as a pilot task in CLEF 2006. Wikipedia recently caught the attention of various researchers [5, 1] as a resource for the QA task, in particular for the direct extraction of answers. WiQA is a quite different task, since it is aimed at helping the readers/authors of Wikipedia rather than finding answers to user questions. In the words of the organizers 1 , the purpose of the WiQA pilot is “to see how IR and NLP techniques can be effectively used to help readers and authors of Wikipages get access to information spread thoughout Wikipedia rather than stored locally on the pages”. An author of a given Wikipage can be interested in collecting information about the topic of the page that is not yet included in the text, but is relevant and important for the topic, so that it can be used to update the content of the Wikipage. Therefore, an automatic system will provide the author with information snippets extracted from Wikipedia with the following characteristics: unseen : not already included in the given source page; 1 http://ilps.science.uva.nl/WiQA/Task/index.html