Predicting the News of Tomorrow Using Patterns in Web Search Queries Kira Radinsky, Sagie Davidovich and Shaul Markovitch Computer Science Department Technion–Israel Institute of Technology {kirar, sagied, shaulm}@cs.technion.ac.il Abstract The novel task we aim at in this work is to predict top terms that will prominently appear in the future news. This is a difficult task that nobody attempted before, as far as we know. We present a novel methodology for using patterns of user queries to predict future events. Query history is obtained from web resources such as Google Trends. In order to predict whether a term will appear in tomorrow’s news, we examine if the terms in today’s queries indicated this term in the past. We provide empirical support for the effectiveness of our method by showing its prediction power on news archives. 1 Introduction Many organizations invest significant efforts in trying to predict events that are likely to take place in the near fu- ture. Such predictions can be beneficial for various pur- poses, such as planning, resource allocation and identifica- tion of risks and opportunities. Predicting global events in politics, economics, society, etc. is a difficult task that is usually performed by human experts possessing extensive domain-specific and common-sense knowledge. Is it possi- ble to design an algorithm that will automate this process? Many events are hard to predict due to the fact that their occurrence is spread over a long time period, in a very tan- gled net of relations and mutual influence. However, some of them share a common pattern. Events that indicated other events in the past, might do so again as history repeats it- self. Some of the events have preliminary signs. Identifying these signs may enable us to predict the events themselves. The current state of the art in NLP does not allow us to perform deep analysis of news to identify events. However, a global event usually draws the attention of web users, trig- gering many of them to submit queries related to that event. We assume that the popularity of terms appearing in such queries peaks when the event occurs. These spikes can be used as supporting evidence for the occurrence of the event. In this paper we present a novel method, PROFET, that mines large-scale web resources to predict terms that are likely to appear in the news of the near future. Specifically, we predict 100 terms that will prominently appear in the news up to one week from now. The main resource used by our method is a search-query history archive (specifically, in this work we use Google Trends). In order to predict whether an event will appear in tomorrow’s news, we exam- ine if the terms representing today’s events (extracted from today’s queries) indicated this event in the past. This is done by analysis of patterns in user queries for these terms. We test our algorithm by examining if the terms it pre- dicts indeed appear in the news. We compared its perfor- mance to a baseline method which assumes that the news of today will be the news of tomorrow and found our algo- rithm to be significantly better, especially for longer predic- tion periods. The main contributions of this paper are threefold: First, we introduce a new method for prediction of global future events using their patterns in the past. Second, we present a novel usage of aggregated collection of search queries. Finally, we introduce a testing methodology for evaluating such news prediction algorithms. 2 The Prediction Algorithm In this work we obtain history of user queries from two main sources: 1. Google Trends is a service that provides charts repre- senting the popularity of given search terms over time. 2. Google Hot Trends is a service that presents the 100 top searched queries on a certain day, that deviate the most from their historic search pattern. The service also provides related searches for each of the top terms. 2.1 Formal framework Let W = {w 1 ,w 2 , ..., w k } be a set of terms characteriz- ing events. Let D = 〈d 1 , ..., d n 〉 be an ordered set of days. 1