An Alternate Downloading Methodology of Webpages
Anirban Kundu
1
, Alok Ranjan Pal
1
, Tanay Sarkar
1
, Moutan Banerjee
1
, Subhendu Mandal
1
,
Rana Dattagupta
2
and Debajyoti Mukhopadhyay
3
1
Netaji Subhash Engineering College (West Bengal University of Technology),
West Bengal-700 152, India
{anik76in, chhaandasik, tanay.sarkar, moutanbanerjee, subhendu.mndl}@gmail.com
2
Jadavpur University, West Bengal-700 032, India
rdattagupta@cse.jdvu.ac.in
3
Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal-743 503, India
debajyoti.mukhopadhyay@gmail.com
Abstract
We propose an advanced method for downloading Web-
pages from the internet. In this technique, the whole sys-
tem is considered as a bundle of crawlers which have
been created dynamically at execution time. Numbers of
crawlers are used depending on the requirement of down-
loading Webpages. The software module which interacts
with WWW to search one or more Webpages is known as
crawler. The numbers of crawlers are generated using the
hierarchy structure of the Web server from which the data
would be downloaded. Webpage downloader is an impor-
tant issue for downloading Web documents from the internet
to facilitate a Web user in terms of knowledge gathering.
This type of downloaders are very popular in the ‘Infor-
mation Technology’ field. All kinds of public data, accessi-
ble throughout the world without any authentication, can be
retrieved any time from any geographic location using the
downloading methodology. Typically, a downloading tech-
nique has been utilized to accumulate Webpages of different
domains within a single computer machine one at a time.
So, our aim in this paper is to show an advanced technique
for downloading a lot of related Webpages with a minimum
effort and time using Hierarchical Downloader consisting
of several dynamic crawlers.
Keyword - Multi-downloading, Hierarchical downloading.
1 Introduction
In recent years, it has become important for perform
downloading operation efficiently in terms of information
retrieval [1] due to the enormous growth of World Wide
Web (WWW). The world at present generates near about
1 to 2 exabytes of unique information each year, and also
translates to about 250 megabytes for every man, woman
and child on earth (an exabyte is a billion gigabytes). The
World Wide Web Worm (WWWW) was one of the first Web
Search Engines, and was basically a storage of huge volume
of information [2]. With the advent of the WWW, users are
now trying to propagate the information to a much wider
audience more quickly via some medium of communica-
tion. In this information & technology era, if somebody
wishes to gather information, he/she can find a lot of data
related to the topic through WWW from any location in the
world using some Web browser [3]. Web browser helps
people to reach the desired information with an ease instan-
taneously over the internet. In practical scenario, a typical
Web browserinvokes Webpages one at a time. One has to
check all the links available on a Webpage for downloading
more than one Webpage for collecting overall information
on a particular topic [4]. For example, if the user wishes
to read a tutorial of a specific subject, all the hyperlinks
should be checked in a trial and error basis. So, all of the
information is specified within a Webpage in terms of their
URLs [5]. A typical Webpage consists of a set of URLs that
possibly contain the sought information. So, it will take a
longer time to retrieve complete information using the avail-
able methods.
2008 Seventh Mexican International Conference on Artificial Intelligence
978-0-7695-3441-1/08 $25.00 © 2008 IEEE
DOI 10.1109/MICAI.2008.13
393
Authorized licensed use limited to: Haldia Inst of Tech. Downloaded on December 27, 2008 at 07:54 from IEEE Xplore. Restrictions apply.