Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IJCSMC, Vol. 2, Issue. 8, August 2013, pg.243 – 247
RESEARCH ARTICLE
© 2013, IJCSMC All Rights Reserved 243
Design of Improved Web Crawler By
Analysing Irrelevant Result
Prashant Dahiwale
1
, Dr. M.M. Raghuwanshi
2
, Dr. Latesh Malik
3
¹Research Scholar, Dept. of Computer Science & Engineering, G.H.Raisoni College of Engineering, India
²Professor, Dept. of Comp Sc. Engg, Rajiv Gandhi College of Engineering & Research, Nagpur, India
3
Professor, Dept. of Comp Sc. Engg, G.H.Raisoni College of Engineering, Nagpur, India
1
prashantdd.india@gmail.com;
2
m.raghuwanshi@rediffmail.com;
3
latesh.malik@raisoni.net
Abstract— A key issue in designing a focused Web crawler is how to determine whether an unvisited URL is
relevant to the search topic. Effective relevance prediction can help avoid downloading and visiting many
irrelevant pages. In this module, we propose a new learning-based approach to improve relevance prediction
in focused Web crawlers. For this study, we chose Naïve Bayesian as the base prediction model, which
however can be easily switched to a different prediction model. The performance of a focused crawler
depends mostly on the richness of links in the specific topic being searched, and focused crawling usually
relies on a general web search engine for providing starting points.
Key Terms: - URL; focused crawler; classifier; relevance prediction; links; search engine; ranking
I. INTRODUCTION
As the number of Internet users and the number of accessible Web pages grows, it is becoming increasingly
difficult for users to find documents that are relevant to their particular needs. Users must either browse through
a large hierarchy of archives to find the information for which they are looking or submit a query to a publicly
available search engine and wade through hundreds of results, most of them irrelevant.
Typing “Java” as keywords into Google search engine would lead to around 25 million results with quotation
marks and 237 million results without quotation marks. With the same keywords, Yahoo search engine leads to
around 8 million results with quotation marks and 139 million results without quotation marks, while MSN
search engine leads to around 8 million results with quotation marks and around 137 million results without
quotation marks. These huge numbers of results are brought to the user, but most of them are barely relevant or
uninteresting to the users.
The key issue is the relevance issue of a webpage to a specific topic. Popular search engines depend on
indexing databases that rely on running various web crawlers collecting information, thus main aim of a focused
crawler is how to classify relevancy of a new, unvisited URL.
II. LITERATURE SURVEY
In [1], they use “Stock Market” as a sample topic, and extend the learning-based relevance prediction model
proposed in “Intelligent Focused Crawler: Learning Which Links to Crawl” from two relevance attributes to