Does Wikipedia Information Help Netﬂix Predictions? John Lees-Miller, Fraser Anderson, Bret Hoehn, Russell Greiner University of Alberta Department of Computing Science {leesmill, frasera, hoehn, greiner}@cs.ualberta.ca Abstract We explore several ways to estimate movie similarity from the free encyclopedia Wikipedia with the goal of im- proving our predictions for the Netﬂix Prize. Our system ﬁrst uses the content and hyperlink structure of Wikipedia articles to identify similarities between movies. We then predict a user’s unknown ratings by using these similarities in conjunction with the user’s known ratings to initialize matrix factorization and k-Nearest Neighbours algorithms. We blend these results with existing ratings-based predic- tors. Finally, we discuss our empirical results, which sug- gest that external Wikipedia data does not signiﬁcantly im- prove the overall prediction accuracy. 1 Introduction Netﬂix distributes movies via an internet site. Their ser- vice includes a recommender system that suggests movies to a user based on that user’s past movie ratings. The Net- ﬂix Prize is a competition to inspire researchers to ﬁnd ways to produce more accurate recommendations. In particular, the challenge is to predict how a user will rate a particular movie, seen on a speciﬁed date. To help, Netﬂix provides approximately 100 million ratings for 17 770 movies by 480 thousand users as training data for a collaborative ﬁltering method. Of these ratings provided by Netﬂix, 1.4 million are designated as the probe set, which we use for testing; see [4]. Previous approaches have achieved considerable success using only this ratings data [3]. We begin with the hy- pothesis that external data from Wikipedia can be used to improve prediction accuracy. The use of such data has been shown to be beneﬁcial in many collaborative ﬁlter- ing tasks, both in recommendation system and movie do- mains [7, 12, 16]. Balabanovic and Shoham [1] successfully use content extracted from an external source, the Internet Movie Database (IMDb), to complement collaborative ﬁl- tering methods. Netﬂix allows contestants to use additional data, but it must be free for commercial use. Unfortunately, this eliminates IMDb, even though this is known to be use- ful for Netﬂix predictions [14]. Fortunately, this does not prevent us from using movie information from the free en- cyclopedia Wikipedia 1 . Data sources such as IMDb and Yahoo! Movies are highly structured, making it easy to ﬁnd salient movie fea- tures. As Wikipedia articles are much less structured, it is more difﬁcult to extract useful information from them. Our approach is as follows. First, we identify the Wikipedia ar- ticles corresponding to the Netﬂix Prize movies (Section 2). We then estimate movie similarity by computing article similarity based on both article content (term and document frequency of words in the article text; Section 3) and hyper- link structure (especially shared links; Section 4). We use this information to make predictions with k-Nearest Neigh- bors (k-NN) and stochastic gradient descent matrix factor- ization [3,6,11] (also known as Pseudo-SVD) methods. We then blend our predictions with others from the University of Alberta’s Reel Ingenuity team (Section 5). 2 Page Matching We use a Wikipedia snapshot 2 containing roughly 6.2 million articles, most of which are not related to movies. In order to use the Wikipedia data we must map each Net- ﬂix title to an appropriate Wikipedia article, if one exists. We use several methods to ﬁnd such matches. One method ﬁnds articles using longest common subse- quence and keyword weighting in the article titles. We put more weight on words like “movie” and “ﬁlm,” and less weight on words like “Season” and “Volume,” which tend to be present in the Netﬂix titles but not in Wikipedia arti- cle titles. This method matches 14 992 Netﬂix titles with Wikipedia articles, of which approximately 77% are appro- priate; here and below, we estimate appropriateness with spot checks conducted by the authors. 1 http://en.wikipedia.org/wiki/Wikipedia:Copyrights 2 http://download.wikimedia.org (February 25, 2008) 2008 Seventh International Conference on Machine Learning and Applications 978-0-7695-3495-4/08 $25.00 © 2008 IEEE DOI 10.1109/ICMLA.2008.121 337