Temporal Dynamics of User Interests in Web Search Queries Aysegul Cayci, Selcuk Sumengen, Cagatay Turkay Sabanci University {aysegulcayci,selcuk,turkay}@su.sabanciuniv.edu Selim Balcisoy, Yucel Saygin Sabanci University {balcisoy, ysaygin}@sabanciuniv.edu Abstract Web search query logs contain valuable information which can be utilized for personalization and improvement of search engine performance. The aim in this paper 1 is to cluster users based on their interests, and analyze the temporal dynamics of these clusters. In the proposed approach, we first apply clustering techniques to group similar users with respect to their web searches. Anticipating that the small number of query terms used in search queries would not be sufficient to obtain a proper clustering scheme, we extracted the summary content of the clicked web page from the query log. In this way, we enriched the feature set more efficiently than the content crawling. We also provide preliminary survey results to evaluate clusters. Clusters may change with the user flow from one cluster to the other as time passes. This is due to the fact that users’ interests may shift over time. We used statistical methods for the analysis of temporal changes in users’ interests. As a case study, we experimented on the query logs of a search engine. 1. Introduction Query logs of search engines are an indispensable source of information to understand web search behavior. Query logs are analyzed in particular for search result re-ranking, search result clustering, query suggestion, or modification. In this work, we cluster users based on their queries and clicks in the query log made available by the AOL 2 search engine. We first cluster users with respect to their query terms and then 1 This work was partially funded by The Scientific and Technological Research Council of Turkey (TUBITAK) under grant number 105E185. 2 Note that later on AOL withdrew this data from their website due to privacy concerns. In this paper, we do not intent to breach the privacy of individuals. The research results in this paper are only aggregate information which do not leak personal information. analyze temporal variations among clusters. We use three statistical measures: overall overlap, distinct overlap and Pearson correlation proposed in [4] to assess the temporal changes in user interests. We use document clustering technique presented in [10] by incorporating feature enrichment. Existing query clustering methods proposed in the literature deal with the problem of short queries since query clustering is similar to document clustering, but the small number of terms used in the queries prevents generation of meaningful clusters. One method to handle this problem is to consider the URL that the user has clicked as an implicit feedback, and use the content of the page addressed by the URL to enrich feature set of clustered queries [2][8]. Another method is to incorporate the number of same URLs that users have clicked into the similarity measure [3][5][9]. Wen, et. al, [9] compared the precision and recall of query clusters produced by different similarity measures. They report that clusters produced by the similarity of keywords together with shared number of clicks have the highest precision and recall. Beeferman and Berger, [3] proposed a content-ignorant query clustering method where similarity by keyword is not considered but queries with similar clicks are clustered. Chan, et. al’s [5] query clustering algorithm improves Beeferman and Berger’s by eliminating noisy user clicks. Clusters can also be improved by taking advantage of the fact that contents of clicked URLs provide more keywords, by including the keywords either from the top ranked result snippets[8] or from the top ranked result pages[2]. Fonseca, et. al, [6] use association rule mining to find related queries. Ross and Wolfram, [7] categorized term pairs in the queries and use query clustering to discover common terms of categories. A cluster of users submitting similar queries share common interests and constitute a group of users. It has been shown by Beitzel, et. al, [4] that category popularity of queries change over the hours in a day. We investigate the flow between user groups over six time slices of a day. There exist few temporal query 2009 International Conference on Advanced Information Networking and Applications Workshops 978-0-7695-3639-2/09 $25.00 © 2009 IEEE DOI 10.1109/WAINA.2009.71 762