Does Wikipedia Information Help Netflix Predictions? John Lees-Miller, Fraser Anderson, Bret Hoehn, Russell Greiner University of Alberta Department of Computing Science {leesmill, frasera, hoehn, greiner}@cs.ualberta.ca Abstract We explore several ways to estimate movie similarity from the free encyclopedia Wikipedia with the goal of im- proving our predictions for the Netflix Prize. Our system first uses the content and hyperlink structure of Wikipedia articles to identify similarities between movies. We then predict a user’s unknown ratings by using these similarities in conjunction with the user’s known ratings to initialize matrix factorization and k-Nearest Neighbours algorithms. We blend these results with existing ratings-based predic- tors. Finally, we discuss our empirical results, which sug- gest that external Wikipedia data does not significantly im- prove the overall prediction accuracy. 1 Introduction Netflix distributes movies via an internet site. Their ser- vice includes a recommender system that suggests movies to a user based on that user’s past movie ratings. The Net- flix Prize is a competition to inspire researchers to find ways to produce more accurate recommendations. In particular, the challenge is to predict how a user will rate a particular movie, seen on a specified date. To help, Netflix provides approximately 100 million ratings for 17 770 movies by 480 thousand users as training data for a collaborative filtering method. Of these ratings provided by Netflix, 1.4 million are designated as the probe set, which we use for testing; see [4]. Previous approaches have achieved considerable success using only this ratings data [3]. We begin with the hy- pothesis that external data from Wikipedia can be used to improve prediction accuracy. The use of such data has been shown to be beneficial in many collaborative filter- ing tasks, both in recommendation system and movie do- mains [7, 12, 16]. Balabanovic and Shoham [1] successfully use content extracted from an external source, the Internet Movie Database (IMDb), to complement collaborative fil- tering methods. Netflix allows contestants to use additional data, but it must be free for commercial use. Unfortunately, this eliminates IMDb, even though this is known to be use- ful for Netflix predictions [14]. Fortunately, this does not prevent us from using movie information from the free en- cyclopedia Wikipedia 1 . Data sources such as IMDb and Yahoo! Movies are highly structured, making it easy to find salient movie fea- tures. As Wikipedia articles are much less structured, it is more difficult to extract useful information from them. Our approach is as follows. First, we identify the Wikipedia ar- ticles corresponding to the Netflix Prize movies (Section 2). We then estimate movie similarity by computing article similarity based on both article content (term and document frequency of words in the article text; Section 3) and hyper- link structure (especially shared links; Section 4). We use this information to make predictions with k-Nearest Neigh- bors (k-NN) and stochastic gradient descent matrix factor- ization [3,6,11] (also known as Pseudo-SVD) methods. We then blend our predictions with others from the University of Alberta’s Reel Ingenuity team (Section 5). 2 Page Matching We use a Wikipedia snapshot 2 containing roughly 6.2 million articles, most of which are not related to movies. In order to use the Wikipedia data we must map each Net- flix title to an appropriate Wikipedia article, if one exists. We use several methods to find such matches. One method finds articles using longest common subse- quence and keyword weighting in the article titles. We put more weight on words like “movie” and “film,” and less weight on words like “Season” and “Volume,” which tend to be present in the Netflix titles but not in Wikipedia arti- cle titles. This method matches 14 992 Netflix titles with Wikipedia articles, of which approximately 77% are appro- priate; here and below, we estimate appropriateness with spot checks conducted by the authors. 1 http://en.wikipedia.org/wiki/Wikipedia:Copyrights 2 http://download.wikimedia.org (February 25, 2008) 2008 Seventh International Conference on Machine Learning and Applications 978-0-7695-3495-4/08 $25.00 © 2008 IEEE DOI 10.1109/ICMLA.2008.121 337