Comprehensive Review of Web Focused Crawling Promila Devi #1 , Ravinder Thakur *2 # M.Tech Scholar, Department of Computer Science & Engineering, LRIET Solan, HPTU Hamirpur, India * Assistant Professor, Department of Computer Science & Engineering, LRIET Solan, HPTU Hamirpur, India Abstract— Finding useful information from the Web which has a huge and widely distributed structure requires efficient search techniques. Distributive and varying nature of Web resources is always major issue for search engines maintain latest index of the Web content as they have to crawl the Web after fixed interval of time. A focused driven crawler is a specific type of straggler that analyzes its crawl boundary to find the links that are to be in range for the crawl while avoiding undesired areas of the Web. Still many types of crawlers have been suggested that have different crawler strategies. To do this, focused crawler has an algorithm for classifying. In this paper we are review algorithm used to classify in focused crawlers. These algorithms may be based on page contents or uses a semantic classification or even on both. Keywords— Focused Crawler, Web Crawler, Search engine, Relevancy prediction. I. INTRODUCTION The rapid growth of the World-Wide Web poses unpredictable challenges for general-purpose crawlers and search engines. A focused crawler or topical straggler is a web crawler that attempts to download only web pages that are related to a pre-defined topic or given set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled examples of required and not required pages are available. A focused crawler may be described as a crawler which returns relevant web pages on a given topic in the web. There are a number of issues related to existing focused crawlers, in particular the ability to ``tunnel'' through lowly ranked pages in the search path to highly ranked pages related to a topic which might re-occur further down the search path. A focused crawler has the following main components: (a) A way to determine if a particular web page is relevant to the given topic, and (b) a way to determine how to proceed from a known set of pages. An early search engine which deployed the focused crawling strategy was proposed in [1] based on the intuition that relevant pages often contain relevant links. It searches deeper when relevant pages are found, and stops searching at pages not as relevant to the topic. Unfortunately, the above crawlers show an important drawback when the pages about a topic are not directly connected in which case the crawling might stop pre-maturely. This problem is tackled in [3] where reinforcement learning permits credit assignment during the search process, and hence, allowing off-topic pages to be included in the search path. However, this approach requires a large number of training examples, and the method can only be trained offline. In [2], a set of classifiers are trained on examples to estimate the distance of the current page from the closest on-topic page. But the training procedure is quite complex. Our focused crawler aims at providing a simpler alternative for overcoming the issue that immediate pages which are lowly ranked related to the topic at hand. The idea is to recursively execute an exhaustive search up to a given depth , starting from the ``relatives'' of a highly ranked page. Hence, a set of candidate pages is obtained by retrieving pages reachable within a given perimeter from a set of initial seeds. From the set of candidate pages, we look for the page which has the best score with respect to the topic at hand. This page and its ``relatives'' are inserted into the set of pages from which to proceed the crawling process. Our assumption is that an ``ancestor'' with a good reference is likely to have other useful references in its descendants further down the lineage even if immediate scores of web pages closer to the ancestor are low. We define a degree of relatedness with respect to the page with the best score. If is large, we will include more distant ``cousins'' into the set of seeds which are further and further away from the highest scored page. This device overcomes the difficulties of using reinforcement learning in assigning credits, without the burden of solving a dynamic programming problem. These ideas may be considered as an extension to [1,2], as the use of a degree of relatedness extends the concept of child pages in [1] while avoiding the complex issue of inherence of scores, and the use of a perimeter is similar to the ``layer'' concept used in [2]. II. RELATED WORK The Focused crawling was first introduced by Chakrabarti in 1999[4]. One of the first web crawlers was proposed by Cho J et. al.[5] and they introduced a best first strategy. Fish-Search [1] is an example of early crawlers that prioritizes unvisited URLs on a queue for a specific search goal. The Fish-Search approach assigns priority values (1 or 0) to candidate pages using simple keyword matching. One of the disadvantages of Fish-Search is that all relevant pages are assigned the same priority value 1 based on keyword matching. The Shark-Search [6] is a modified version of Fish-Search, in which, Vector Space Model (VSM) is used, and the priority values (more than just 1 and 0) are computed based on the priority values of parent pages, page content, and Promila Devi et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (5) , 2014, 6035-6038 www.ijcsit.com 6035