Movie Recommendation System Maike Heinrich 24 9 2021 Building a movie recommendation system based on machine learning 1. Introduction / overview Harvard University launched a challenge to write the code for a movie recommendation system. Fully awarded with points is that ﬁnal code, that achieves an RMSE (residual or root mean squared error) of less than 0.86490. Years ago, Netﬂix rewarded a team of data scientists with one million dollars as they achieved an RMSE of about 0.857. The RMSE is a common used measure of the distance between the predicted values and the actual ones. It’s the square root of the average of all distances between these values, squared. So the formula would look like this: RMSE = square root (average (predicted values - actual values) squared). So let’s ﬁnd out if it’s possible to get the full points of Harvard University for this challenge. Before writing an algorithm to predict ratings, the data set has to be observed and pre-processed. This includes visualizing some interesting facts that help to understand the approach that has to be used for writing the code. The data set is divided into an “edx”-set to train the machine learning algorithm on and a “validation” set, which will only be used to test the ﬁnal code, to see the value of the RMSE. So it can’t be touched during building the algorithm. Therefore the edx data has to be separated again into two parts, a training and a test set. The data has to be cleaned from “noisy” data, which means for example users, who only rated one or two movies, or movies which only get one rating - taking these kind of values fully into account can hardly be a representative approach for a solid prediction, because we don’t know if this user is very cranky or loves every movie he watches. Therefor regularization will be taken into account and a term will be included which will shrink these insecure estimate of these values towards zero, gives them less weight. This term will be calculated using cross-validation. Additionally the approach here will be to modify the training data a bit by ﬁltering for users who only rated 20 or more times. This will improve the algorithm as it will be more stable. At the end this algorithm will be used with the complete unchanged edx set with all users and tested on the untouched validation set. Some movies are similar, so there is a movie to movie eﬀect, likewise some users rate similar, have similar preferences, that’s the user to user eﬀect. Both can be taken into account with matrix factorization. This method and how regularization and cross-validation works will be further explained in the methods / analysis part. So let’s get started. 2. Methods / analysis (contains 3 parts) Part 1: The basic code provided by Harvard University to build the algorithms on (The provided code can be observed in the script ﬁle but is not shown here.) 1