Journal of Intelligent Information Systems, 23:1, 5–16, 2004 c 2004 Kluwer Academic Publishers. Printed in The United States. Interval Set Clustering of Web Users with Rough K-Means PAWAN LINGRAS Pawan.Lingras@StMarys.CA (WWW:http://www.stmarys.ca) CHAD WEST Saint Mary’s University, Halifax, Nova Scotia, B3H 3C3, Canada Received May 29, 2002; Revised December 11, 2002; Accepted April 1, 2003 Abstract. Data collection and analysis in web mining faces certain unique challenges. Due to a variety of reasons inherent in web browsing and web logging, the likelihood of bad or incomplete data is higher than conventional applications. The analytical techniques in web mining need to accommodate such data. Fuzzy and rough sets provide the ability to deal with incomplete and approximate information. Fuzzy set theory has been shown to be useful in three important aspects of web and data mining, namely clustering, association, and sequential analysis. There is increasing interest in research on clustering based on rough set theory. Clustering is an important part of web mining that involves finding natural groupings of web resources or web users. Researchers have pointed out some important differences between clustering in conventional applications and clustering in web mining. For example, the clusters and associations in web mining do not necessarily have crisp boundaries. As a result, researchers have studied the possibility of using fuzzy sets in web mining clustering applications. Recent attempts have used genetic algorithms based on rough set theory for clustering. However, the genetic algorithms based clustering may not be able to handle the large amount of data typical in a web mining application. This paper proposes a variation of the K-means clustering algorithm based on properties of rough sets. The proposed algorithm represents clusters as interval or rough sets. The paper also describes the design of an experiment including data collection and the clustering process. The experiment is used to create interval set representations of clusters of web visitors. Keywords: clustering, interval sets, K-means algorithm, rough sets, unsupervised learning, web mining 1. Introduction Web mining can be viewed as the extraction of structure from an unlabeled, semi-structured data set containing the characteristics of users and information (Joshi and Krishnapuram, 1998). Logs of web access available on most servers are good examples of the data set used in web mining. Three important operations in web mining are clustering, association, and sequential analysis. This paper focuses on clustering, which is a process of identifying natural groupings of objects. The clustering process is an important step in establishing user profiles. User profiling on the web consists of studying important charactersitcs of the web visitors. Due to the ease of movement from one portal to another, web users can be very mobile. If a particular web site doesn’t satisfy the needs of the user in a relatively short period of time, the user will Author to whom all correspondence should be addressed.