An Optimization Framework for Query Recommendation Aris Anagnostopoulos 1 aris@cs.brown.edu Luca Becchetti 1 becchett@dis.uniroma1.it Carlos Castillo 2 chato@yahoo-inc.com Aristides Gionis 2 gionis@yahoo-inc.com 1 Sapienza University of 2 Yahoo! Research Labs Rome, Italy Barcelona, Spain ABSTRACT Query recommendations are an integral part of modern search engines. Their goal is to facilitate users’ search tasks, as well as help them discover and explore concepts related to their information needs. In this paper, we present a formal treatment of the problem of query recommendation. In our framework we model the user-querying behavior by a proba- bilistic reformulation graph, or query-flow graph [Boldi et al. CIKM 2008], so that the sequence of queries submitted by a user can be seen as a path on this graph. Assigning score values to queries allows us to define suitable utility functions and to consider the expected utility achieved by performing a random walk on the query-flow graph. Furthermore, pro- viding recommendations can be seen as adding shortcuts in the query-flow graph that “nudge” the reformulation paths of users, in such a way that users are more likely to follow paths with larger expected utility. We discuss in detail the most important questions that come up in the proposed framework. In particular, we pro- vide examples of meaningful utility functions to optimize, we discuss how to estimate the effect of recommendations on the reformulation probabilities, we address the complex- ity of the optimization problems we consider, and we suggest efficient algorithmic solutions. We validate our models and algorithms with extensive experimentation. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Re- trieval 1. INTRODUCTION A prominent feature of modern search engines is the pres- ence of query recommendations in response to user queries. Partially supported by EU Project N. 215270 FRONTS and by MIUR FIRB project N. RBIN047MH9: “Tecnologia e Scienza per le reti di prossima generazione”. Submitted for confidential review. August 2009. Query recommendations serve several purposes: correcting possible spelling errors, guiding users through their informa- tion seeking tasks, allowing them to locate information more easily, and helping them explore other concepts related to what they are looking for. The simplest form of a query recommendation is spell cor- rection, a topic that we do not address in this paper. Instead we focus on more elaborate forms of query recommenda- tions. For instance, by submitting the query “chocolate cookie” a user may be prompted to other queries such as “chocolate cookie recipe”, “chocolate chip cookie recipe”, but also to related queries such as “brownies”, baking”, and so on. A key enabling technology for query-recommendation is query log mining, which is used to leverage information about how people use search engines, and how they rephrase their queries when they are looking for information. Most of the proposed query recommendation algorithms in the literature use aggregate user information found on query logs to find potentially successful queries that are relevant to what the user is searching [2–4, 15, 16]. Current state-of- the-art methods often produce relevant query recommenda- tions, but often there is no clear objective to optimize and the query-recommendation algorithms are fairly ad-hoc. In this paper we propose a general and principled method- ology for generating query recommendations. We model the query-recommendation problem as a problem of optimizing a global utility function. Our methodology makes the fol- lowing assumptions, which are also the main ingredients of our approach: First, we assume that it is possible to aggregate histor- ical information from the query logs to build a query- reformulation graph G [3]. The nodes of the graph are distinct queries, and an edge (q,q ) is annotated with the probability that a user will submit query q after submitting query q. We then model the query- ing behavior of users as weighted random walks on this graph. Second, we assume that the queries in the query-flow graph have intrinsic score values w(q) that are increas- ing on a desired property of the query q, for example, the probability that users that issue q will be satisfied with the search engine results. We assume that dur- ing the random walk on the query-flow graph, users collect the scores of (a subset of) the nodes that they are visiting. The higher the total value collected the higher is the overall utility of the system.