Advances in Computer Science and Information Technology (ACSIT)
Print ISSN: 2393-9907; Online ISSN: 2393-9915; Volume 2, Number 11; April-June, 2015 pp. 1-6
© Krishi Sanskriti Publications
http://www.krishisanskriti.org/acsit.html
Focused Web Crawler
Dvijesh Bhatt
1
, Daiwat Amit Vyas
2
and Sharnil Pandya
3
1,2,3
Institute of Technology, Nirma University
E-mail:
1
dvijesh.bhatt@nirmauni.ac.in,
2
, daiwat.vyas@nirmauni.ac.in
3
sharnil.pandya@nirmauni.ac.in
Abstract—With the rapid development and increase in global data
on World Wide Web and with increased and rapid growth in web
users across the globe, an acute need has arisen to improve and
modify or design search algorithms that helps in effectively and
efficiently searching the specific required data from the huge
repository available. Various search engines use different web
crawlers for obtaining search results efficiently. Some search engines
use focused web crawler that collects different web pages that usually
satisfy some specific property, by effectively prioritizing the crawler
frontier and managing the exploration process for hyperlink. A
focused web crawler analyzes its crawl boundary to locate the links
that are likely to be most relevant for the crawl, and avoids irrelevant
regions of the web. This leads to significant savings in hardware and
network resources, and helps keep the crawl more up-to-date.The
process of focused web crawler is to nurture a collection set of web
documents that are focused on some topical subspaces. It identifies
the next most important and relevant link to follow by relying on
probabilistic models for effectively predicting the relevancy of the
document. Researchers across have proposed various algorithms for
improving efficiency of focused web crawler. We try to investigate
various types of crawlers with their pros and cons.Major focus area
is focused web crawler. Future directions for improving efficiency of
focused web crawler have been discussed. This will provide a base
reference for anyone who wishes in researching or using concept of
focused webcrawler in their research work that he/she wishes to
carry out. The performance of a focused webcrawler depends on the
richness of links in the specific topic being searched by the user, and
it usually relies on a general web search engine for providing
starting points for searching.
Keywords: Focused Web Crawler, algorithms, World Wide Web,
probabilistic models.
1. INTRODUCTION
Innovations in the field of web technology and data mining
has had a significant impact on the way web based
technologies are being developed. Internet has been the most
useful technology of modern times and has become the largest
knowledge base and data repository. Internet has various
diversified uses like in communication, research, financial
transactions, entertainment, crowdsourcing, and politics and is
responsible for the professional as well as the personal
development of individuals be he/she be a technical person or
a non-technical person. Every person is so acquainted with
online resources that somehow or the other is dependent on
online resources for his/her day to day activities.
Search engines
[6]
The basic purpose of enhancement in the search results
specific to some keywords can be achieved through focused
web crawler.
are the most basic tools used for searching
over the internet. Web search engines are usually
equippedwith multiple powerful web page search algorithms.
But with the explosive growth of the World Wide
Web,searching information on the web is becoming an
increasingly difficult task. All this poses an unprecedented
scaling challenges for the general purpose crawlers and search
engines. Major challenges like, to make users available with
the fastest possible access to the requested information in a
most precise manner, making lighter web interfaces, etc are
being addressed by researchers across the globe.
Web Crawlers are one of the main components ofweb search
engines i.e. systems that assemble a corpus of web pages,
indexthem, and allow users to issue queries against the index
and find the webpages that match the queries fired by the
users.Web crawling is the process by which system gather
pages from the Web resources, inorder to index them and
support a search engine that serves the user queries. The
primary objective of crawlingis to quickly, effectively and
efficiently gather as many useful web pages as possible,
together with the link structure that interconnects them and
provide the search results to the user requesting it. A crawler
must possess features like robustness, scalability, etc.
The first generation of crawlers on which most of the search
engines are based, rely heavily on the traditional graph
algorithms like the breadth-first search and the depth-first
search to index the web. In the NetCraft Web Server survey,
the Web is measured in the number of Websites which from a
small number in August 1995 increased over 1 billion in April
2014. Due to the vast expansion of the Web and the inherently
limited resources in a search engine, no single search engine is
able to index more than one-third of the entire Web. This is
the primary reason for general purpose web crawlers having
poor performance.
[3]
With the exponential increase in the number of
Websites, more emphasis is made in the implementation of
focused web crawler. It is a crawling technique that