2nd International Conference and workshop on Emerging Trends in Technology (ICWET) 2011 Proceedings published by International Journal of Computer Applications® (IJCA) 51 Focused Web Crawler with Page Change Detection Policy Swati Mali, VJTI, Mumbai B.B. Meshram VJTI, Mumbai ABSTRACT Focused crawlers aim to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. The major problem is how to retrieve the maximal set of relevant and quality pages. In this paper, We propose an architecture that concentrates more over page selection policy and page revisit policy The three-step algorithm for page refreshment serves the purpose. The first layer contributes to decision of page relevance using two methods. The second layer checks for whether the structure of a web page has been changed or not, the text content has been altered or whether an image is changed. Also a minor variation to the method of prioritizing URLs on the basis of forward link count has been discussed to accommodate the purpose of frequency of update. And finally, the third layer helps to update the URL repository. General Terms Algorithms, Performance, Design Keywords Focused crawler, page change detection, crawler policies, crawler database. 1. INTRODUCTION A crawler is an automated script, which independently browses the World Wide Web. It starts with a seed URL and then follows the links on each page in a Breadth First or a Depth First method [1]. A Web Crawler searches through all the Web Servers to find information about a particular topic. However, searching all the Web Servers and the pages, are not realistic, because of the growth of the Web and their refresh rates. To traverse the Web quickly and entirely is an expensive, unrealistic goal because of the required hardware and network resources [1, 2]. Focused Crawling is designed to traverse a subset of the Web to gather documents on a specific topic and addresses the above problem [3]. The major problem of the focused crawler is how to identify the promising links that lead to target documents, and avoid off-topic searches. To address this problem we not only use content of web page to improve page relevance but also uses link structure to improve the coverage of a specific topic. Also, it is no longer limited to simple HTML pages, but it supports a whole variety of pages used to display dynamic content and ever changing layouts. The outline of the paper is as follows: Section 2 provides a more detailed overview of focused crawling. Section 3 describes the architecture and implementation of our approach. Comparisons with existing focused crawling algorithms on some test crawls are as shown in Section 4, and we conclude by discussing extensions and implications in Section 5. 2. LITERATURE REVIEW A focused crawler is a program used for searching information related to some interested topics from the Internet. The main property of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves relevant pages only[1] [2] [3]. The design of a good crawler presents many challenges. Externally, the crawler must avoid overloading Web sites or network links as it goes about its business. Internally, the crawler must deal with huge volumes of data. Unless it has unlimited computing resources and unlimited time, it must carefully decide what URLs to scan and in what order. The crawler must also decide how frequently to revisit pages it has already seen, in order to keep its crawler informed of changes on the Web. 2.1 General Architecture Roughly, a crawler starts with the URL for an initial page P0. It retrieves P0, extracts any URLs in it, and adds them to a queue of URLs to be scanned. Then the crawler gets URLs from the queue (in some order), and repeats the process. Every page that is scanned is given to a crawler that saves the pages, creates an index for the pages, or summarizes or analyzes the content of the pages[1] [3] [5]. Design of basic crawler is as shown in fig 1. 2.2 Crawling Policies The behavior of a Web crawler is the outcome of a combination of policies [1] [3]: A selection policy : This states which pages to download As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages and not just a random sample of the Web. Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling. A re-visit policy : This states when to check for changes to the pages. The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a Web crawler has finished its crawl, many events could have happened. These events can include creations, updates, and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource.