A Nonparametric Hierarchical Bayesian Framework for Information Filtering Kai Yu , Volker Tresp , Shipeng Yu Corporate Technology, Siemens AG, Munich, Germany Institute for Computer Science, University of Munich, Germany kai.yu@siemens.com, volker.tresp@siemens.com, spyu@dbs.informatik.uni-muenchen.de ABSTRACT Information filtering has made considerable progress in re- cent years.The predominant approaches are content-based methods and collaborative methods. Researchers have largely concentrated on either of the two approaches since a princi- pled unifying framework is still lacking. This paper suggests that both approaches can be combined under a hierarchical Bayesian framework. Individual content-based user profiles are generated and collaboration between various user models is achieved via a common learned prior distribution. How- ever, it turns out that a parametric distribution (e.g. Gaus- sian) is too restrictive to describe such a common learned prior distribution. We thus introduce a nonparametric com- mon prior, which is a sample generated from a Dirichlet process which assumes the role of a hyper prior. We de- scribe effective means to learn this nonparametric distribu- tion, and apply it to learn users’ information needs. The resultant algorithm is simple and understandable, and of- fers a principled solution to combine content-based filtering and collaborative filtering. Within our framework, we are now able to interpret various existing techniques from a uni- fying point of view. Finally we demonstrate the empirical success of the proposed information filtering methods. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering, retrieval mod- els General Terms Algorithms, Theory, Human Factors Keywords Collaborative Filtering, Content-Based Filtering, Dirichlet Process, Nonparametric Bayesian Modelling Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’04, July 25–29, 2004, Sheffield, South Yorkshire, UK. Copyright 2004 ACM 1-58113-881-4/04/0007 ...$5.00. 1. INTRODUCTION Information filtering denotes a family of techniques that help users to find the right information items while filtering out undesired ones. In a wide range of applications, such as spam email filtering, news filtering, and recommender systems for products (e.g. books), information filtering is playing an increasingly important role. Content-based fil- tering (CBF) and collaborative filtering (CF) represent the two major information filtering technologies. CBF [19, 2, 10, 13] has its root in the concept of relevance feedback in the information retrieval literature (e.g. Rocchio’s algorithm [19]). CBF explores the similarity of contents between infor- mation items (e.g. articles, images, music), to infer which of the yet unseen items might be of interest for the active user, based on some annotated examples previously given by the user. In contrast, collaborative filtering methods [18, 20, 4] typically accumulate a database of item ratings—explicitly or implicitly—cast by a large set of users. The prediction of ratings for the active user is solely based on the ratings provided by all other users, under the assumption that like- minded users are sharing similar information needs. The method does not rely on a description of item content. One major difficulty in designing CBF systems lies in extracting content features that are sufficiently indicative. There is often a large gap between low-level content fea- tures (visual, auditory, or others) and high-level user in- terests (like or dislike a painting or a CD). In some other circumstances, the features are not available at all. Fortu- nately, the information on personal preferences and interests are all carried in (explicit or implicit) user ratings. Thus CF systems can make use of these high level features rather eas- ily, by combining the ratings of other like-minded users. Pure CF only relies solely on user preferences, without incorporating the actual content of items. CF often suffers from the extreme sparsity of available data, in the sense that users typically rate only very few items, thus making it dif- ficult to compare the interests of two users. Furthermore, pure CF can not handle items for which no user has previ- ously given a rating. Such cases are easily handled in CBF systems, which can make predictions based on the content of the new item. Naturally, previous researchers have worked on compen- sating the drawbacks of each particular approach. Many approaches have focused on hybrid filtering to unify both ap- proaches [12, 7, 2, 3, 16]. However, due to the lack of a uni- fying framework for information filtering, existing solutions were developed mostly in heuristic or ad-hoc ways. The chal-