A New Approach to Design Domain Specific Ontology Based Web Crawler
Debajyoti Mukhopadhyay, Arup Biswas, Sukanta Sinha
Web Intelligence & Distributed Computing Research Lab, Techno India Group
West Bengal University of Technology
EM 4/1, Salt Lake Sector V, Calcutta 700091, India
{debajyoti.mukhopadhyay, biswas.arup, sukantasinha2003}@gmail.com
Abstract
A domain specific Web search engine is a search engine
which replies to domain specific user queries. The crawler
in a domain specific search engine must crawl through the
domain specific Web pages in the World Wide Web
(WWW). For a crawler it is not an easy task to download
the domain specific Web pages. Ontology can play a vital
role in this context. Our focus will be to identify Web
pages for a particular domain in WWW.
1. Introduction
A search engine is a document retrieval system which
helps find information stored in a computer system, such
as in the World Wide Web (WWW), inside a corporate or
proprietary network, or in a personal computer. Surveys
indicate that almost 25% of Web searchers are unable to
find useful results in the first set of URLs that are
returned. The term ontology [1] is an old term used in the
field of Knowledge Representation, Information Modeling,
etc. Typically ontology i s a hierarchical data structure
containing relevant entities, relationships and rules within
a specific domain. Tom R. Gruber [2] defines ontology as a
specification of a conceptualization .
2. Web Search Crawling
A standard crawler crawls through all the pages in
breadth first strategy. So if we want to crawl through some
domain then it will be very inefficient. In Figure 1 we show
the general crawler crawling activity.
Fig.1. Standard Crawling
If some crawler crawls only through domain specific pages
then it is a focused crawler. From Figure 2 we can see that
a focused crawler crawls through domain specific pages.
The pages which are not related to the particular domain
are not considered.
Fig.2. Focused (Domain Specific) Crawling
3. Our Approach
In our approach we crawl through the Web and add
Web pages to the database, which are related to a specific
domain (i.e. a specific ontology) and discard Web pages
which are not related to the domain. In this section we will
show how to determine domain specific page.
3.1 Relevance Calculation
In this section we describe our own algorithm
depending on which we calculate relevancy of a Web
page on a specific domain.
3.1.1 Weight Table. We want to add some weights to
each term in the ontology. The strategy of assigning
weights is that, the more specific term will have more
weight on it. And the terms which are common to more
than one domain have less weight. The sample Weight
table for some terms of a given ontology of the table
shown below:
Ontology terms Weight
Assistant Professor 1.0
Assistant 0.6
Student 0.4
Worker 0.1
Publication 0.1
Fig.3. Weight table for the above ontology
10th International Conference on Information Technology
0-7695-3068-0/07 $25.00 © 2007 IEEE
DOI
283
10th International Conference on Information Technology
0-7695-3068-0/07 $25.00 © 2007 IEEE
DOI 10.1109/ICIT.2007.20
283
10th International Conference on Information Technology
0-7695-3068-0/07 $25.00 © 2007 IEEE
DOI 10.1109/ICIT.2007.20
289
10th International Conference on Information Technology
0-7695-3068-0/07 $25.00 © 2007 IEEE
DOI 10.1109/ICIT.2007.20
289