Advances in Computer Science and Information Technology (ACSIT) Print ISSN: 2393-9907; Online ISSN: 2393-9915; Volume 2, Number 11; April-June, 2015 pp. 1-6 © Krishi Sanskriti Publications http://www.krishisanskriti.org/acsit.html Focused Web Crawler Dvijesh Bhatt 1 , Daiwat Amit Vyas 2 and Sharnil Pandya 3 1,2,3 Institute of Technology, Nirma University E-mail: 1 dvijesh.bhatt@nirmauni.ac.in, 2 , daiwat.vyas@nirmauni.ac.in 3 sharnil.pandya@nirmauni.ac.in Abstract—With the rapid development and increase in global data on World Wide Web and with increased and rapid growth in web users across the globe, an acute need has arisen to improve and modify or design search algorithms that helps in effectively and efficiently searching the specific required data from the huge repository available. Various search engines use different web crawlers for obtaining search results efficiently. Some search engines use focused web crawler that collects different web pages that usually satisfy some specific property, by effectively prioritizing the crawler frontier and managing the exploration process for hyperlink. A focused web crawler analyzes its crawl boundary to locate the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.The process of focused web crawler is to nurture a collection set of web documents that are focused on some topical subspaces. It identifies the next most important and relevant link to follow by relying on probabilistic models for effectively predicting the relevancy of the document. Researchers across have proposed various algorithms for improving efficiency of focused web crawler. We try to investigate various types of crawlers with their pros and cons.Major focus area is focused web crawler. Future directions for improving efficiency of focused web crawler have been discussed. This will provide a base reference for anyone who wishes in researching or using concept of focused webcrawler in their research work that he/she wishes to carry out. The performance of a focused webcrawler depends on the richness of links in the specific topic being searched by the user, and it usually relies on a general web search engine for providing starting points for searching. Keywords: Focused Web Crawler, algorithms, World Wide Web, probabilistic models. 1. INTRODUCTION Innovations in the field of web technology and data mining has had a significant impact on the way web based technologies are being developed. Internet has been the most useful technology of modern times and has become the largest knowledge base and data repository. Internet has various diversified uses like in communication, research, financial transactions, entertainment, crowdsourcing, and politics and is responsible for the professional as well as the personal development of individuals be he/she be a technical person or a non-technical person. Every person is so acquainted with online resources that somehow or the other is dependent on online resources for his/her day to day activities. Search engines [6] The basic purpose of enhancement in the search results specific to some keywords can be achieved through focused web crawler. are the most basic tools used for searching over the internet. Web search engines are usually equippedwith multiple powerful web page search algorithms. But with the explosive growth of the World Wide Web,searching information on the web is becoming an increasingly difficult task. All this poses an unprecedented scaling challenges for the general purpose crawlers and search engines. Major challenges like, to make users available with the fastest possible access to the requested information in a most precise manner, making lighter web interfaces, etc are being addressed by researchers across the globe. Web Crawlers are one of the main components ofweb search engines i.e. systems that assemble a corpus of web pages, indexthem, and allow users to issue queries against the index and find the webpages that match the queries fired by the users.Web crawling is the process by which system gather pages from the Web resources, inorder to index them and support a search engine that serves the user queries. The primary objective of crawlingis to quickly, effectively and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them and provide the search results to the user requesting it. A crawler must possess features like robustness, scalability, etc. The first generation of crawlers on which most of the search engines are based, rely heavily on the traditional graph algorithms like the breadth-first search and the depth-first search to index the web. In the NetCraft Web Server survey, the Web is measured in the number of Websites which from a small number in August 1995 increased over 1 billion in April 2014. Due to the vast expansion of the Web and the inherently limited resources in a search engine, no single search engine is able to index more than one-third of the entire Web. This is the primary reason for general purpose web crawlers having poor performance. [3] With the exponential increase in the number of Websites, more emphasis is made in the implementation of focused web crawler. It is a crawling technique that