An analysis of web proxy logs with query distribution pattern approach for
search engines
Mona Taghavi
a
, Ahmed Patel
b, c,
⁎, Nikita Schmidt
b
, Christopher Wills
c
, Yiqi Tew
b
a
Department of Computer, Science and Research Branch, Islamic Azad University, Tehran, Iran
b
Department of Computer Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (The National University of Malaysia),
43600 Bangi, Selangor Darul Ehsan, Malaysia
c
Centre for Applied Research in Information Systems, Faculty of Computing Information Systems and Mathematics, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE,
United Kingdom
abstract article info
Article history:
Received 1 July 2011
Accepted 18 July 2011
Available online 23 July 2011
Keywords:
Web search services
Search engines
Query analysis
Distributed search engines
Proxy server logs
This study presents an analysis of users' queries directed at different search engines to investigate trends and
suggest better search engine capabilities. The query distribution among search engines that includes
spawning of queries, number of terms per query and query lengths is discussed to highlight the principal
factors affecting a user's choice of search engines and evaluate the reasons of varying the length of queries.
The results could be used to develop long to short term business plans for search engine service providers to
determine whether or not to opt for more focused topic specific search offerings to gain better market share.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
The rapid growth of user accesses to World Wide Web (WWW)
and its applications as well as the increase in the amount of content
create difficulty for information retrieval that leads to making the task
of Web search highly critical in the face of very short response time
and near exact search results against user specified queries. Current
Web search engines are built to provide answers to all query requests,
independent of the special needs of any individual user. Much of it is
done by search engines prompting or directing the user to select
websites. However, nowadays, search engines attempt to identify
some of the user's intentions and suggest more precise or relevant key
terms. Although, they have improved a great deal over the years, but
their results are still far from perfect.
As a matter of fact, users reveal their private information about
their current interests by submitting a search query. Analysis of this
information enables search service providers to more or less precisely
target their search features capabilities to users' needs.
The above mentioned gap and opportunities behind the undiscovered
query's patterns motivated this research study has been to provide
statistics on numerous aspects of user query behaviour, the distribution
of queries over time and changing trends in user behaviour to investigate
the problem of how to answer queries efficiently in the current
competitive search engines marketplace.
The analysis provided in this study was carried in the context of a
distributed search system for the Internet developed by the Adaptive
Distributed Search and Advertising (ADSA) research project [1] as part of
the advances in Web systems and Web robots/crawlers and aims to
design advanced distributed search engines offering high-quality focused
topic-specific document databases [2]. From a top-level architectural
viewpoint, an ADSA system is a collection of components – Search
Engines and Brokers – dispersed across the Internet as shown in Fig. 1
with the following most prominent properties:
• The system supports both document search and placement of
advertisements for the purpose of revenue generation.
• Search engines are designed to be topic-specific in order to improve
the system's focused target query handling and scalability. Therefore,
a number of distributed focused Web robots form a key part of the
ADSA system.
• Attribute-value based search facility gives users access to document
structure when making search queries.
• Each ADSA system can be independently owned and managed
autonomously in a federated cooperative yet competitive business
environment.
In general, the distributed search engine systems consist of many
search engines acting as one global search system. Each search engine
Computer Standards & Interfaces 34 (2012) 162–170
⁎ Corresponding author at: Kingston University, United Kingdom.
E-mail addresses: mona.taghavi@gmail.com (M. Taghavi),
whinchat2010@gmail.com (A. Patel), nikita.schmidt@gmail.com (N. Schmidt),
ccwills@kingston.ac.uk (C. Wills), yiqi01@gmail.com (Y. Tew).
0920-5489/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.csi.2011.07.001
Contents lists available at ScienceDirect
Computer Standards & Interfaces
journal homepage: www.elsevier.com/locate/csi