An analysis of web proxy logs with query distribution pattern approach for search engines Mona Taghavi a , Ahmed Patel b, c, , Nikita Schmidt b , Christopher Wills c , Yiqi Tew b a Department of Computer, Science and Research Branch, Islamic Azad University, Tehran, Iran b Department of Computer Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (The National University of Malaysia), 43600 Bangi, Selangor Darul Ehsan, Malaysia c Centre for Applied Research in Information Systems, Faculty of Computing Information Systems and Mathematics, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, United Kingdom abstract article info Article history: Received 1 July 2011 Accepted 18 July 2011 Available online 23 July 2011 Keywords: Web search services Search engines Query analysis Distributed search engines Proxy server logs This study presents an analysis of users' queries directed at different search engines to investigate trends and suggest better search engine capabilities. The query distribution among search engines that includes spawning of queries, number of terms per query and query lengths is discussed to highlight the principal factors affecting a user's choice of search engines and evaluate the reasons of varying the length of queries. The results could be used to develop long to short term business plans for search engine service providers to determine whether or not to opt for more focused topic specic search offerings to gain better market share. © 2011 Elsevier B.V. All rights reserved. 1. Introduction The rapid growth of user accesses to World Wide Web (WWW) and its applications as well as the increase in the amount of content create difculty for information retrieval that leads to making the task of Web search highly critical in the face of very short response time and near exact search results against user specied queries. Current Web search engines are built to provide answers to all query requests, independent of the special needs of any individual user. Much of it is done by search engines prompting or directing the user to select websites. However, nowadays, search engines attempt to identify some of the user's intentions and suggest more precise or relevant key terms. Although, they have improved a great deal over the years, but their results are still far from perfect. As a matter of fact, users reveal their private information about their current interests by submitting a search query. Analysis of this information enables search service providers to more or less precisely target their search features capabilities to users' needs. The above mentioned gap and opportunities behind the undiscovered query's patterns motivated this research study has been to provide statistics on numerous aspects of user query behaviour, the distribution of queries over time and changing trends in user behaviour to investigate the problem of how to answer queries efciently in the current competitive search engines marketplace. The analysis provided in this study was carried in the context of a distributed search system for the Internet developed by the Adaptive Distributed Search and Advertising (ADSA) research project [1] as part of the advances in Web systems and Web robots/crawlers and aims to design advanced distributed search engines offering high-quality focused topic-specic document databases [2]. From a top-level architectural viewpoint, an ADSA system is a collection of components Search Engines and Brokers dispersed across the Internet as shown in Fig. 1 with the following most prominent properties: The system supports both document search and placement of advertisements for the purpose of revenue generation. Search engines are designed to be topic-specic in order to improve the system's focused target query handling and scalability. Therefore, a number of distributed focused Web robots form a key part of the ADSA system. Attribute-value based search facility gives users access to document structure when making search queries. Each ADSA system can be independently owned and managed autonomously in a federated cooperative yet competitive business environment. In general, the distributed search engine systems consist of many search engines acting as one global search system. Each search engine Computer Standards & Interfaces 34 (2012) 162170 Corresponding author at: Kingston University, United Kingdom. E-mail addresses: mona.taghavi@gmail.com (M. Taghavi), whinchat2010@gmail.com (A. Patel), nikita.schmidt@gmail.com (N. Schmidt), ccwills@kingston.ac.uk (C. Wills), yiqi01@gmail.com (Y. Tew). 0920-5489/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.csi.2011.07.001 Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi