Clustering of Hub and Authority Web Documents for Information Retrieval Kavita Kanathey Computer Science Barkatullah University Bhopal,MP,India R. S. Thakur Department of Computer Application Maulana Azad National Institute of Technology (MANIT) Bhopal, MP, India Shailesh Jaloree Department of applied Mathematics SATI,Vidisha,Bhopal,MP,India Abstract- Due to the exponential growth of World Wide Web (or simply the Web), finding and ranking of relevant web documents has become an extremely challenging task. When a user tries to retrieve relevant information of high quality from the Web, then ranking of search results of a user query plays an important role. Ranking provides an ordered list of web documents so that users can easily navigate through the search results and find the information content as per their need. In order to rank these web documents, a lot of ranking algorithms (PageRank, HITS, Weight PageRank) have been proposed based upon many factors like citations analysis, content similarity, annotations etc. However, the ranking mechanism of these algorithms gives user with a set of non classified web documents according to their query. In this paper, we propose a link-based clustering approach to cluster search results returned from link based web search engine. By filtering some irrelevant pages, our approach classified relevant web pages into most relevant, relevant and irrelevant groups to facilitate users’ accessing and browsing. In order to increase relevancy accuracy, K-mean clustering algorithm is used. Preliminary evaluations are conducted to examine its effectiveness. The results show that clustering on web search results through link analysis is promising. This paper also outlines various page ranking algorithms. Keywords - World Wide Web, search engine, information retrieval, Pagerank, HITS, Weighted Pagerank, link analysis. I. INTRODUCTION The World Wide Web is a famous and interactive way to disseminate information nowadays. The Web is the largest information repository for knowledge reference. The web is huge, semi-structured, dynamic, and heterogeneous and broadly distributed global information service center [5]. Finding relevant web pages of highest quality to the users based on their queries becomes increasingly difficulty. This can be observed by the researcher that most of the web documents collected by web spider are not relevant to the query of the user. It makes in-convenience for the user to filter out irrelevant information from these search results, hence leading to waste of time. For these reasons, the cluster search engine provides a way to find the information, by returning a set of classified web pages. An important class of search engine that offer search results based on hypertext links between sites can be termed as Link Based Search Engine. Rather than providing results based on keywords or the content of the web documents, sites are ranked based on the quality and quantity of other web sites linked to them. In this system, user submits a query to the meta-search engine. The meta-search engine searches for the relevant results of users query. From the set of results retrieved from web search engine, they are formed as a meta- directory tree. This tree structure helps the user to retrieve information with high relevancy. The relevancy of web page can be obtained by considering the number of in-links and out-links present in a particular web page. When the web page has more number of out-links to a relevant page, then that page can be considered as a central page. From this central page, all other web pages are compared for similarity and the most similar pages are grouped together. The grouping of most similar pages together is known as clustering. Clustering can be done based on different algorithms such as hierarchical, k-means, partitioning, etc. The simplest unsupervised learning algorithm that solve clustering problem is K- Means algorithm. It is a simple and easy way to classify a given data set through a certain number of clusters. When the documents are clustered [9] using K-Means algorithm, the cluster contains more similar documents and it increases the relevancy rate of search results. When a user requests for a query after these clustering process, they get only the most relevant cluster which matches the request. They will not get any of the irrelevant pages. So, it increases the efficiency of search results and reduces computational time and search space. The paper is organized as follows. Section II is an assessment of previous related works of link analysis and clustering in web domain. In Section III, we describe the existing system. Subsequently in Section IV we describe our proposed approach in detail. In Section V, We conclude our paper with some discussions. International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 3, March 2016 418 https://sites.google.com/site/ijcsis/ ISSN 1947-5500