International Journal of Computer Applications (0975 – 8887) Volume 40– No.3, February 2012 1 Context based Web Indexing for Storage of Relevant Web Pages Nidhi Tyagi Asst. Prof. Shobhit University, Meerut Rahul Rishi Prof. & Head Technical Institute of Textile & Sciences, Bhiwani R.P. Agarwal Prof. & Head Shobhit University, Meerut ABSTRACT A focused crawler downloads web pages that are relevant to a user specified topic. The downloaded documents are indexed with a view to optimize speed and performance in finding relevant documents for a search query at the search engine side. However, the information will be more relevant if the context of the topic is also made available to the retrieval system. This paper proposes a technique for indexing the keyword extracted from the web documents along with their contexts wherein it uses a height balanced binary search (AVL) tree, for indexing purpose to enhance the performance of the retrieval system. General Terms Algorithm, retrieval, indexer. Keywords AVL tree, contextual, repository, balance factor. 1. INTRODUCTION With the rapid growth of the Internet, the World Wide Web (WWW) has become one of the most important resources for obtaining information and one of the most important media of communication. The basic aim is to select the best collection of information according to users need. The existing focused crawlers [1, 2] adopt different strategies for computing the words’ frequency in the web documents. If higher frequency words match with the topic keyword, then the document is considered to be relevant. But they generally do not analyze the context of the keyword in the web page before they download it. The subject of context has received a great deal of attention in the information recovery literature for seeking relevant information [9]. Exploring the contents of the web pages for automatic indexing is of fundamental importance for various web applications. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in documents is a time consuming task. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval. In AVL (i.e. height balanced binary tree) tree [3], the height of a tree is defined as the length of the longest path from the root node of the tree to one of its leaf node, and the balance factor (BF) is: (height of left subtree – height of right subtree). To call the AVL as balanced the value of BF should be -1, 0 or 1. This strategy makes the searching task faster and optimized. 2. RELATED WORK In this section, a review of previous work on index organization is given. In this field of index organization and maintenance, many algorithms and techniques have already been proposed but they seem to be less efficient in efficiently accessing the index. F. Silvestri, R.Perego and Orlando [4] proposed the reordering algorithm which partitions the set of documents into k ordered clusters on the basis of similarity measure where the biggest document is selected as centroid of the first cluster and n/k1 most similar documents are assigned to this cluster. Then the biggest document is selected and the same process repeats. The process keeps on repeating until all the k clusters are formed and each cluster gets completed with n/k documents. This algorithm is not effective in clustering the most similar documents. Oren Zamir and Oren Etzioni., [5] proposed threshold based clustering algorithm. Initially, the number of clusters is unknown, based on the specified threshold value the similarity between the two documents is evaluated and they are classified to the same cluster. If the threshold is small; all the elements will get assigned to different clusters. If the threshold is large, the elements may get assigned to just one cluster. Thus the algorithm is sensitive to specification of threshold. C. Zhou, W. Ding and Na Yang [6], the paper introduces a double indexing mechanism for search engines based on campus Net. The CNSE consists of crawl machine, Chinese automatic segmentation, index and search machine. The proposed mechanism has document index as well as word index. The document index is based on, where the documents do the clustering, and ordered by the position in each document. During the retrieval, the search engine first gets the document id of the word in the word index, and then goes to the position of corresponding word in the document index. Because in the document index, the word in the same document is adjacent, the search engine directly compares the largest word matching assembly with the sentence that users submit. The mechanism proposed, seems to be time consuming as the index exists at two levels. N. Chauhan and A. K. Sharma [7] proposed, the context driven focused crawler (CDFC) that searches and downloads only highly relevant web pages, thus, reducing the network traffic. A category tree has been used, which provides flexibility to the user for interacting with the system showing the broad categories of the topics on the web. The proposed