International Journal of Computer and Information Technology (ISSN: 2279 – 0764) Volume 03 – Issue 05, September 2014 www.ijcit.com 929 A New Weighted Keyword Based Similarity Measure for Clustering Webpages Shihab Rahman, Dolon Chapa, and Shaily Kabir * Department of Computer Science and Engineering University of Dhaka Dhaka, Bangladesh * Email: shailykabir2000 {at} yahoo.com Abstract— Relevant information from the web can quickly be retrieved if logically similar webpages are grouped together. Indeed, the clustering of web pages makes entire group available to the user, thereby increasing the efficiency of web browsing. Nevertheless, clustering largely depends on the accuracy of similarity computation among the pages. In this paper, we propose a new weighted keyword based similarity measure for discovering the alikeness among the pages. We present each page using a vector of extracted keywords, which is then converted into a weighted vector by considering both frequency and position of the keywords in the page. For determining semantic similarity between two pages, we take into account both syntactic and semantic relatedness between the respective weighted vectors. Finally, the webpages are grouped using this similarity measure by applying a fuzzy clustering algorithm. Our experimental results based on different cluster validation indices show considerable improvement in page clustering as compared to use of other existing similarity measures. Keywords - web content mining; keyword extraction; similarity measure; fuzzy clustering; cluster validation index I. INTRODUCTION The World Wide Web has grown in a phenomenal rate in recent years. Moreover, web content has been changing every day with new information resulting in an enormous volume of semi-structured and unstructured data. This huge volume of data is quite useless to a user if he gets overwhelmed with hundreds of websites and webpages while searching and faces difficulties in extracting valuable information. In this context, web mining becomes relevant now-a-days to make the web more user friendly. Among three categories of the web mining, the content mining works with the unstructured and semi-structured data. Further, it aims to mine and extract useful information from the webpage content, and the extracted information is useful for grouping the similar webpages. In this context, our target is to group semantically related pages together for betterment of web accessing. In this paper, we extend the notion of similarity between the webpages by considering their semantic as well as syntactic relatedness. We propose a new semantic similarity measure for computing alikeness among the pages. We present each page through a set of keywords, which are later transformed to a vector of weighted keywords while considering the frequency of the keywords along with their position in different tags within the page. We base our similarity computation between two pages by measuring relatedness between the respective weighted vectors of the keywords. A set of semantically related page models is generated by applying the fuzzy C-means clustering algorithm [18] to our similarity results. For evaluating the effectiveness of our proposed semantic similarity, we have performed extensive experiments. Our experiment results show that better clustering of semantically related pages with the lowest cluster duplication is achieved from the proposed similarity measure compared to other existing measures. The rest of this paper is organized as follows. Section II reviews previous work on similarity measures and their use in creating page grouping through clustering. Section III introduces our proposed work with the new semantic similarity measure among the webpages. Section IV presents results from various experiments. Section V concludes the paper along with future work. II. RELATED RESEARCH ACTIVITIES Webpages often contain a number of distracting features and unnecessary objects such as advertisement, irrelevant video and/or audio, which may divert the user attention from the actual page content they are interested in. Indeed, an extension research works have been done to efficiently exclude the irrelevant entities and to successfully identify the keywords from the page. Generally, there are two types of keyword-extraction approaches [1]. One approach is domain- dependent based on supervised machine learning model, whereas other is domain-independent. Among all other keyword extraction approaches, TF_IDF (term frequency and inverse document frequency) weighting has been widely used [2]. Frank et al. [3] introduced an automatic keyword extraction algorithm (KEA) based on the TF_IDF, which was later enhanced by Kelleher et al. [4] through introducing Semantic Ratio (SR) feature. Typically, successful clustering of the web pages mainly depends on the similarity result of the extracted keywords of the pages. Prior research activities