International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 02 | Feb -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 128
WEB CRAWLER FOR MINING WEB DATA
S.AMUDHA, B.SC., M.SC., M.PHIL.,
Assistant Professor, VLB Janakiammal College of Arts and Science,
Tamilnadu, India
amudhajaya@gmail.com
-----------------------------------------------------------------********---------------------------------------------------------------------
ABSTRACT
Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an
orderly fashion. Web crawlers are full text search engines which assist users in navigating the web. Web crawling is an important
method for collecting data on, and keeping up with, the rapidly expanding Internet. Users can find their resources by using
different hypertext links. A vast number of web pages are continually being added every day, and information is constantly
changing. Search engines are used to extract valuable Information from the internet. Web crawlers are the principal part of
search engine, is a computer program or software that browses the World Wide Web in a methodical, automated manner or in an
orderly fashion. This Paper is an overview of various types of Web Crawlers and the policies like selection, revisit, politeness, and
parallelization.
Key Words: Web Crawler, World Wide Web, Search Engine, Hyperlink, Uniform Resource Locator.
1. INTRODUCTION
A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are
recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the
information. The archives are regularly stored in such a way they can be viewed, read and navigated as they were on the live
web, but are preserved as Ǯsnapshots'.
The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to
prioritize to download. The high rate of change can imply the pages might have already been updated or even deleted. The
number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid
retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection
will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified
through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file
formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs,
all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort
through endless combinations of relatively minor scripted changes in order to retrieve unique content. The World Wide Web
has grown from a few thousand pages in 1993 to more than two billion pages at present. The contributing factor to this
explosive growth is the widespread use of microcomputer, increased case of use in computer packages and most importantly
tremendous opportunities that the web offers to business .New tools and techniques are crucial for intelligently searching for
useful information on the web [10].
Web crawling is an important method for collecting data and keeping up to date with the rapidly expanding Internet. A web
crawler is a program, which automatically traverses the web by downloading documents and following links from page to
page. It is a tool for the search engines and other information seekers to gather data for indexing and to enable them to keep
their databases up to date. All search engines internally use web crawlers to keep the copies of data a fresh. Search engine is
divided into different modules. Among those modules crawler module is the module on which search engine relies the most
because it helps to provide the best possible results to the search engine. Crawlers are small programs that ǯbrowseǯ the web