Temporal Dynamics of User Interests in Web Search Queries
Aysegul Cayci, Selcuk Sumengen,
Cagatay Turkay
Sabanci University
{aysegulcayci,selcuk,turkay}@su.sabanciuniv.edu
Selim Balcisoy,
Yucel Saygin
Sabanci University
{balcisoy, ysaygin}@sabanciuniv.edu
Abstract
Web search query logs contain valuable
information which can be utilized for personalization
and improvement of search engine performance. The
aim in this paper
1
is to cluster users based on their
interests, and analyze the temporal dynamics of these
clusters. In the proposed approach, we first apply
clustering techniques to group similar users with
respect to their web searches. Anticipating that the
small number of query terms used in search queries
would not be sufficient to obtain a proper clustering
scheme, we extracted the summary content of the
clicked web page from the query log. In this way, we
enriched the feature set more efficiently than the
content crawling. We also provide preliminary survey
results to evaluate clusters. Clusters may change with
the user flow from one cluster to the other as time
passes. This is due to the fact that users’ interests may
shift over time. We used statistical methods for the
analysis of temporal changes in users’ interests. As a
case study, we experimented on the query logs of a
search engine.
1. Introduction
Query logs of search engines are an indispensable
source of information to understand web search
behavior. Query logs are analyzed in particular for
search result re-ranking, search result clustering, query
suggestion, or modification. In this work, we cluster
users based on their queries and clicks in the query log
made available by the AOL
2
search engine. We first
cluster users with respect to their query terms and then
1
This work was partially funded by The Scientific and
Technological Research Council of Turkey (TUBITAK) under grant
number 105E185.
2
Note that later on AOL withdrew this data from their website due
to privacy concerns. In this paper, we do not intent to breach the
privacy of individuals. The research results in this paper are only
aggregate information which do not leak personal information.
analyze temporal variations among clusters. We use
three statistical measures: overall overlap, distinct
overlap and Pearson correlation proposed in [4] to
assess the temporal changes in user interests.
We use document clustering technique presented in
[10] by incorporating feature enrichment. Existing
query clustering methods proposed in the literature
deal with the problem of short queries since query
clustering is similar to document clustering, but the
small number of terms used in the queries prevents
generation of meaningful clusters. One method to
handle this problem is to consider the URL that the
user has clicked as an implicit feedback, and use the
content of the page addressed by the URL to enrich
feature set of clustered queries [2][8]. Another method
is to incorporate the number of same URLs that users
have clicked into the similarity measure [3][5][9].
Wen, et. al, [9] compared the precision and recall of
query clusters produced by different similarity
measures. They report that clusters produced by the
similarity of keywords together with shared number of
clicks have the highest precision and recall. Beeferman
and Berger, [3] proposed a content-ignorant query
clustering method where similarity by keyword is not
considered but queries with similar clicks are clustered.
Chan, et. al’s [5] query clustering algorithm improves
Beeferman and Berger’s by eliminating noisy user
clicks. Clusters can also be improved by taking
advantage of the fact that contents of clicked URLs
provide more keywords, by including the keywords
either from the top ranked result snippets[8] or from
the top ranked result pages[2]. Fonseca, et. al, [6] use
association rule mining to find related queries. Ross
and Wolfram, [7] categorized term pairs in the queries
and use query clustering to discover common terms of
categories.
A cluster of users submitting similar queries share
common interests and constitute a group of users. It
has been shown by Beitzel, et. al, [4] that category
popularity of queries change over the hours in a day.
We investigate the flow between user groups over six
time slices of a day. There exist few temporal query
2009 International Conference on Advanced Information Networking and Applications Workshops
978-0-7695-3639-2/09 $25.00 © 2009 IEEE
DOI 10.1109/WAINA.2009.71
762