A New Approach to Design Domain Specific Ontology Based Web Crawler Debajyoti Mukhopadhyay, Arup Biswas, Sukanta Sinha Web Intelligence & Distributed Computing Research Lab, Techno India Group West Bengal University of Technology EM 4/1, Salt Lake Sector V, Calcutta 700091, India {debajyoti.mukhopadhyay, biswas.arup, sukantasinha2003}@gmail.com Abstract A domain specific Web search engine is a search engine which replies to domain specific user queries. The crawler in a domain specific search engine must crawl through the domain specific Web pages in the World Wide Web (WWW). For a crawler it is not an easy task to download the domain specific Web pages. Ontology can play a vital role in this context. Our focus will be to identify Web pages for a particular domain in WWW. 1. Introduction A search engine is a document retrieval system which helps find information stored in a computer system, such as in the World Wide Web (WWW), inside a corporate or proprietary network, or in a personal computer. Surveys indicate that almost 25% of Web searchers are unable to find useful results in the first set of URLs that are returned. The term ontology [1] is an old term used in the field of Knowledge Representation, Information Modeling, etc. Typically ontology i s a hierarchical data structure containing relevant entities, relationships and rules within a specific domain. Tom R. Gruber [2] defines ontology as a specification of a conceptualization . 2. Web Search Crawling A standard crawler crawls through all the pages in breadth first strategy. So if we want to crawl through some domain then it will be very inefficient. In Figure 1 we show the general crawler crawling activity. Fig.1. Standard Crawling If some crawler crawls only through domain specific pages then it is a focused crawler. From Figure 2 we can see that a focused crawler crawls through domain specific pages. The pages which are not related to the particular domain are not considered. Fig.2. Focused (Domain Specific) Crawling 3. Our Approach In our approach we crawl through the Web and add Web pages to the database, which are related to a specific domain (i.e. a specific ontology) and discard Web pages which are not related to the domain. In this section we will show how to determine domain specific page. 3.1 Relevance Calculation In this section we describe our own algorithm depending on which we calculate relevancy of a Web page on a specific domain. 3.1.1 Weight Table. We want to add some weights to each term in the ontology. The strategy of assigning weights is that, the more specific term will have more weight on it. And the terms which are common to more than one domain have less weight. The sample Weight table for some terms of a given ontology of the table shown below: Ontology terms Weight Assistant Professor 1.0 Assistant 0.6 Student 0.4 Worker 0.1 Publication 0.1 Fig.3. Weight table for the above ontology 10th International Conference on Information Technology 0-7695-3068-0/07 $25.00 © 2007 IEEE DOI 283 10th International Conference on Information Technology 0-7695-3068-0/07 $25.00 © 2007 IEEE DOI 10.1109/ICIT.2007.20 283 10th International Conference on Information Technology 0-7695-3068-0/07 $25.00 © 2007 IEEE DOI 10.1109/ICIT.2007.20 289 10th International Conference on Information Technology 0-7695-3068-0/07 $25.00 © 2007 IEEE DOI 10.1109/ICIT.2007.20 289