An Empirical Study on Keyword-based Web Site Clustering Filippo Ricca, Paolo Tonella, Christian Girardi and Emanuele Pianta ITC-irst Centro per la Ricerca Scientifica e Tecnologica 38050 Povo (Trento), Italy ricca, tonella, cgirardi, pianta @itc.it Abstract Web site evolution is characterized by a limited support to the understanding activities offered to the developers. In fact, design diagrams are often missing or outdated. A po- tentially interesting option is to reverse engineer high level views of Web sites from the content of the Web pages. Clus- tering is a valuable technique that can be used in this re- spect. Web pages can be clustered together based on the similarity of summary information about their content, rep- resented as a list of automatically extracted keywords. This paper presents an empirical study that was con- ducted to determine the meaningfulness for Web develop- ers of clusters automatically produced from the analysis of the Web page content. Natural Language Processing (NLP) plays a central role in content analysis and keyword extrac- tion. Thus, a second objective of the study was to assess the contribution of some shallow NLP techniques to the cluster- ing task. 1 Introduction The decomposition of a whole Web site into smaller co- hesive units that are easier to manage during Web site evolu- tion is often undocumented and can be only guessed (some- times) from the directories. Over time, the navigation struc- ture and the contents in a Web site tend to change, resulting in an increase of the complexity. The initial decomposition, if any existed at all, is often invalidated by the changes. The result is that for existing Web sites little or no support is available for a high level understanding of the overall struc- ture. Among the tools and techniques developed to support understanding and restructuring of existing Web sites [7, 14, 21], those based on clustering are appealing, because they offer the possibility to untangle the navigation struc- ture, abstracting from the single pages to a view consisting of high-level, cohesive units. Similarly to software clustering [2, 11], which aims at gathering software components into higher level groupings, thus providing the user with a more abstract, overall view of the system under analysis, Web site clustering produces a high level view of the Web site organization, in terms of clusters of pages and of the relationships among them. Such a view can be exploited in the process of Web site under- standing, to gain knowledge about the organization of the entire site. More detailed information can be obtained by exploding each cluster of interest into subclusters, up to the individual pages that are the object of a change. Given the central role of the content in Web pages, clus- tering methods developed specifically for Web sites can take advantage of this important source of information. In Web site clustering based on keyword extraction, described in detail elsewhere [20], the content is summarized by a list of keywords that characterize the text in each Web page. Moreover, keywords are weighted so that more specific key- words receive a higher score. Pages are considered similar if they share a large number of keywords. The contribution of each keyword to the similarity measure is proportional to their weight. Similar pages are grouped together by the clustering algorithm. Shallow Natural Language Processing (NLP) tech- niques, that is techniques not based on a deep analysis of meaning, are used to determine the keywords to be associ- ated with the text of each Web page in the site under analy- sis. Moreover, NLP methods can also give a weight to each keyword, according to its relevance and specificity. In this paper, we describe the results of an empirical study that was conducted to investigate the possibility of automating Web site clustering and to assess the role of NLP. More specifically, the empirical study was designed to measure the similarity between the decomposition of a Web site produced by an expert and that obtained through keyword-based clustering. Since the latter depends on the NLP methods adopted for keyword extraction and weight- ing, alternative NLP methods have been contrasted to deter- mine the best performing ones. Moreover, combinations of Proceedings of the 12th IEEE International Workshop on Program Comprehension (IWPC’04) 1092-8138/04 $ 20.00 © 2004 IEEE