Does Wikipedia Information Help Netflix Predictions?
John Lees-Miller, Fraser Anderson, Bret Hoehn, Russell Greiner
University of Alberta
Department of Computing Science
{leesmill, frasera, hoehn, greiner}@cs.ualberta.ca
Abstract
We explore several ways to estimate movie similarity
from the free encyclopedia Wikipedia with the goal of im-
proving our predictions for the Netflix Prize. Our system
first uses the content and hyperlink structure of Wikipedia
articles to identify similarities between movies. We then
predict a user’s unknown ratings by using these similarities
in conjunction with the user’s known ratings to initialize
matrix factorization and k-Nearest Neighbours algorithms.
We blend these results with existing ratings-based predic-
tors. Finally, we discuss our empirical results, which sug-
gest that external Wikipedia data does not significantly im-
prove the overall prediction accuracy.
1 Introduction
Netflix distributes movies via an internet site. Their ser-
vice includes a recommender system that suggests movies
to a user based on that user’s past movie ratings. The Net-
flix Prize is a competition to inspire researchers to find ways
to produce more accurate recommendations. In particular,
the challenge is to predict how a user will rate a particular
movie, seen on a specified date. To help, Netflix provides
approximately 100 million ratings for 17 770 movies by 480
thousand users as training data for a collaborative filtering
method. Of these ratings provided by Netflix, 1.4 million
are designated as the probe set, which we use for testing;
see [4].
Previous approaches have achieved considerable success
using only this ratings data [3]. We begin with the hy-
pothesis that external data from Wikipedia can be used to
improve prediction accuracy. The use of such data has
been shown to be beneficial in many collaborative filter-
ing tasks, both in recommendation system and movie do-
mains [7, 12, 16]. Balabanovic and Shoham [1] successfully
use content extracted from an external source, the Internet
Movie Database (IMDb), to complement collaborative fil-
tering methods. Netflix allows contestants to use additional
data, but it must be free for commercial use. Unfortunately,
this eliminates IMDb, even though this is known to be use-
ful for Netflix predictions [14]. Fortunately, this does not
prevent us from using movie information from the free en-
cyclopedia Wikipedia
1
.
Data sources such as IMDb and Yahoo! Movies are
highly structured, making it easy to find salient movie fea-
tures. As Wikipedia articles are much less structured, it is
more difficult to extract useful information from them. Our
approach is as follows. First, we identify the Wikipedia ar-
ticles corresponding to the Netflix Prize movies (Section 2).
We then estimate movie similarity by computing article
similarity based on both article content (term and document
frequency of words in the article text; Section 3) and hyper-
link structure (especially shared links; Section 4). We use
this information to make predictions with k-Nearest Neigh-
bors (k-NN) and stochastic gradient descent matrix factor-
ization [3,6,11] (also known as Pseudo-SVD) methods. We
then blend our predictions with others from the University
of Alberta’s Reel Ingenuity team (Section 5).
2 Page Matching
We use a Wikipedia snapshot
2
containing roughly 6.2
million articles, most of which are not related to movies.
In order to use the Wikipedia data we must map each Net-
flix title to an appropriate Wikipedia article, if one exists.
We use several methods to find such matches.
One method finds articles using longest common subse-
quence and keyword weighting in the article titles. We put
more weight on words like “movie” and “film,” and less
weight on words like “Season” and “Volume,” which tend
to be present in the Netflix titles but not in Wikipedia arti-
cle titles. This method matches 14 992 Netflix titles with
Wikipedia articles, of which approximately 77% are appro-
priate; here and below, we estimate appropriateness with
spot checks conducted by the authors.
1
http://en.wikipedia.org/wiki/Wikipedia:Copyrights
2
http://download.wikimedia.org (February 25, 2008)
2008 Seventh International Conference on Machine Learning and Applications
978-0-7695-3495-4/08 $25.00 © 2008 IEEE
DOI 10.1109/ICMLA.2008.121
337