Concept Discovery in Collaborative Recommender Systems Patrick Clerkin 1 and P´ adraig Cunningham 2 and Conor Hayes 3 Abstract. There are two main types of recommender systems for e-commerce applications: content-based systems and automated col- laborative ﬁltering systems. We are interested in combining the best features of both approaches. In this paper, we investigate the possibil- ity of using the k-means clustering algorithm as a basis for automat- ically generating content descriptions from the user transaction data that drives the collaborative ﬁltering process. Using the the partitions of the asset space discovered by k-means, we develop a novel rec- ommendation strategy for recommender systems. We present some encouraging results for two real world recommender systems. We conclude by outlining our approach to automatically generating de- scriptions of the clusters and report on an experiment designed to test concepts generated for the SmartRadio recommender system. 1 INTRODUCTION A key role for intelligent systems in e-commerce is product recom- mendation [2]. Large e-commerce sites can have millions of products and customers. Since it is necessary to automatically match products to customers, recommender systems based on statistical, machine learning and knowledge discovery techniques have been developed to meet this need. Broadly, there are two major approaches to the recommendation task, namely, content-based recommendation and automated collab- orative ﬁltering. The objective in this paper is to explore the mech- anisms for taking the raw data on which collaborative recommenda- tion is based and automatically eliciting the more semantically rich cases that can be used for content-based recommendation. One problem with the collaborative approach is the bootstrap problem; there is no basis for making recommendations to new users who have not previously rated any assets (movies, songs, etc). In this paper, we propose that the data that underpins the collab- orative recommendation process can be mined to discover appropri- ate representations to underpin content-based recommendation. We show how cluster analysis can be used to generate high-level repre- sentations that can produce good quality recommendations. We also suggest that these representations are useful in overcoming the boot- strap problem. 2 RECOMMENDER SYSTEMS As stated in the introduction, there are two approaches to recom- mendation on the Web. The recommendation process can be content 1 Machine Learning Group, Department of Computer Science, University of Dublin, Trinity College, Dublin, Ireland, email: Patrick.Clerkin@cs.tcd.ie 2 ditto, email: Padraig.Cunningham@cs.tcd.ie 3 ditto, email: Conor.Hayes@cs.tcd.ie Figure 1. An overview of content-based and collaborative recommendation and the role for knowledge discovery in exploiting the beneﬁts of both approaches based as represented by the upper path in Figure 1 where an appropri- ate representation of the assets and users requirements is determined at design time and recommendation is based on this representation. In the Case-Based Reasoning community this is referred to as case- based recommendation. The alternative lower path in the ﬁgure is automatic collaborative recommendation (ACF) which works with raw data on users ratings and behaviour and uses this data to produce recommendations. The focus of this paper is on how knowledge dis- covery techniques can be applied to this raw data to establish the ap- propriate representations for content-based recommendation. First, we will present brief descriptions of content-based and collaborative recommendation. 2.1 Content-based recommendation Here we will describe a CBR-like content-based recommendation system that we can use for comparison purposes. Table 1 shows a case-like description of a ﬁlm (movie) and Table 2 shows the corresponding description of a user of the recommendation system. In this scenario recommendation is based on how well a ﬁlm matches a users proﬁle. In producing recommendations for a user, the matching score for each ﬁlm in turn would be determined and the highest scoring ﬁlms not already viewed would be recommended.