An Empirical Study on Keyword-based Web Site Clustering
Filippo Ricca, Paolo Tonella, Christian Girardi and Emanuele Pianta
ITC-irst
Centro per la Ricerca Scientifica e Tecnologica
38050 Povo (Trento), Italy
ricca, tonella, cgirardi, pianta @itc.it
Abstract
Web site evolution is characterized by a limited support
to the understanding activities offered to the developers. In
fact, design diagrams are often missing or outdated. A po-
tentially interesting option is to reverse engineer high level
views of Web sites from the content of the Web pages. Clus-
tering is a valuable technique that can be used in this re-
spect. Web pages can be clustered together based on the
similarity of summary information about their content, rep-
resented as a list of automatically extracted keywords.
This paper presents an empirical study that was con-
ducted to determine the meaningfulness for Web develop-
ers of clusters automatically produced from the analysis of
the Web page content. Natural Language Processing (NLP)
plays a central role in content analysis and keyword extrac-
tion. Thus, a second objective of the study was to assess the
contribution of some shallow NLP techniques to the cluster-
ing task.
1 Introduction
The decomposition of a whole Web site into smaller co-
hesive units that are easier to manage during Web site evolu-
tion is often undocumented and can be only guessed (some-
times) from the directories. Over time, the navigation struc-
ture and the contents in a Web site tend to change, resulting
in an increase of the complexity. The initial decomposition,
if any existed at all, is often invalidated by the changes. The
result is that for existing Web sites little or no support is
available for a high level understanding of the overall struc-
ture.
Among the tools and techniques developed to support
understanding and restructuring of existing Web sites [7,
14, 21], those based on clustering are appealing, because
they offer the possibility to untangle the navigation struc-
ture, abstracting from the single pages to a view consisting
of high-level, cohesive units.
Similarly to software clustering [2, 11], which aims at
gathering software components into higher level groupings,
thus providing the user with a more abstract, overall view
of the system under analysis, Web site clustering produces
a high level view of the Web site organization, in terms of
clusters of pages and of the relationships among them. Such
a view can be exploited in the process of Web site under-
standing, to gain knowledge about the organization of the
entire site. More detailed information can be obtained by
exploding each cluster of interest into subclusters, up to the
individual pages that are the object of a change.
Given the central role of the content in Web pages, clus-
tering methods developed specifically for Web sites can take
advantage of this important source of information. In Web
site clustering based on keyword extraction, described in
detail elsewhere [20], the content is summarized by a list
of keywords that characterize the text in each Web page.
Moreover, keywords are weighted so that more specific key-
words receive a higher score. Pages are considered similar
if they share a large number of keywords. The contribution
of each keyword to the similarity measure is proportional
to their weight. Similar pages are grouped together by the
clustering algorithm.
Shallow Natural Language Processing (NLP) tech-
niques, that is techniques not based on a deep analysis of
meaning, are used to determine the keywords to be associ-
ated with the text of each Web page in the site under analy-
sis. Moreover, NLP methods can also give a weight to each
keyword, according to its relevance and specificity.
In this paper, we describe the results of an empirical
study that was conducted to investigate the possibility of
automating Web site clustering and to assess the role of
NLP. More specifically, the empirical study was designed
to measure the similarity between the decomposition of a
Web site produced by an expert and that obtained through
keyword-based clustering. Since the latter depends on the
NLP methods adopted for keyword extraction and weight-
ing, alternative NLP methods have been contrasted to deter-
mine the best performing ones. Moreover, combinations of
Proceedings of the 12th IEEE International Workshop on Program Comprehension (IWPC’04)
1092-8138/04 $ 20.00 © 2004 IEEE