Collaborative Filtering via Concept Decomposition on the Netflix Dataset Nicholas Ampazis 1 Abstract. Collaborative filtering recommender systems make auto- matic predictions about the interests of a user by collecting informa- tion from many users (collaborating). Most recommendation algo- rithms are based in finding sets of customers or items whose ratings overlap in order to create a model for inferring future ratings or items that might be of interest for a particular user. Traditional collabora- tive filtering techniques such as k-Nearest Neighbours and Singular Value Decomposition (SVD) usually provide good accuracy but are computationally very expensive. The Netflix Prize is a collaborative filtering problem whose dataset is much larger than the previously known benchmark sets and thus traditional methods are stressed to their limits when challenged with a dataset of that size. In this paper we present experimental results that show how the concept decompo- sition method performs on the movie rating prediction task over the Netflix dataset and we show that it is able to achieve a well balanced performance between computational complexity and prediction ac- curacy. 1 INTRODUCTION Collaborative filtering (CF) is a subfield of machine learning that aims at creating algorithms to predict user preferences based on past user behavior in purchasing or rating of items [15],[18]. CF rec- ommender systems are very important in e-commerce applications as they contribute much to enhancing user experience and, conse- quently, to generating sales and increasing revenue as they help peo- ple find more easily items that they would like to purchase [19]. In October, 2006 Netflix released a large movie rating dataset and challenged the data mining, machine learning and computer science communities to develop systems that could beat the accuracy of their in-house developed recommendation system (Cinematch) by 10% [3]. In order to render the clallenge more interesting, the company will award a Grand Prize of $1M to the first team that will attain this goal, and in addition, Progress Prizes of $50K will be awarded on the anniversaries of the Prize to teams that make sufficient accuracy im- provements. Apart from the financial incentive however, the Netflix Prize contest is enormously useful for recommender system research since the released Netflix dataset is by far the largest ratings dataset ever becoming available to the research community. Most work on recommender systems outside of companies like Amazon or Netflix up to now has had to make do with the relatively small 1M ratings MovieLens data [12] or the 3M ratings EachMovie dataset [11]. Net- flix provided 100480507 ratings (on a scale from 1 to 5 integral stars) along with their dates from 480189 randomly-chosen, anonymous subscribers on 17770 movie titles. The data were collected between 1 Department of Financial and Management Engineering, University of the Aegean, Greece, email: n.ampazis@fme.aegean.gr October, 1998 and December, 2005 and reflect the distribution of all ratings received by Netflix during this period. Netflix withheld over 3M most-recent ratings from those same subscribers over the same set of movies as a competition qualifying set and contestants are re- quired to make predictions for all 3M withheld ratings in the qual- ifying set. As a performance measure the company has selected the Root Means Square Error (RMSE) criterion between the actual and predicted scores. In addition Netflix also identified a ”probe” subset of the complete training set consisting of about 1.4M ratings as well as the probe Cinematch RMSE value to permit off-line comparison with systems before submission on the qualifying set. In this paper, we present the main components of one of our ap- proaches to the Netflix Prize based on the concept decomposition method [5] and we show that it combines moderate computational complexity with good prediction accuracy on the RMSE criterion. However, due to the limits of the paper and our obvious interests, we intentionally do not publish all details of our method since some small but important details remain hidden. To this end we only report results evaluated on the probe subset of the Netflix dataset. 2 COLLABORATIVE FILTERING RECOMMENDER SYSTEMS The goal of a CF algorithm is to recommend products to a target user based on the opinions of other users [6],[8],[14]. In a typical CF scenario, there is a list of n users U = {u1,u2, ..., un} and a list of m items I = {i1,i2, ..., im}. For each user ui we have a list of items Iu i for which the user has expressed an opinion about. These opinions can be either explicitly given by the user as a rating score (as is the case with Netflix) or can be implicitly derived from the user’s purchase records. Under this setting we consider a distinguished user ua U called the active user for whom the task of a collaborative filtering algorithm is to suggest other items that the active user might like. This suggestion can take either of the following two forms: Prediction: Provide a numerical value, Pa,j expressing the pre- dicted likeliness of item ij / Iua for the active user ua. The predicted value should be within the same scale (e.g., from 1 to 5) as the opinion values provided by ua in the past. Recommendation: Provide a list of N items, Ir I , that the active user will like the most. Obviously the recommended list should only contain items not contained in Iua , that is Ir Iua = Φ. This kind of suggestion is also known as Top-N recommenda- tion. Most collaborative filtering based recommender systems represent every user as an m-dimensional vector of items, where m is the number of distinct catalog items and every item as an n-dimensional