Personalized News Prediction and Recommendation Abhishek Arora Preyas Shah arorabhi@stanford.edu preyas@stanford.edu Dept. of Electrical Engineering Dept. of Mechanical Engineering Stanford University Stanford University Abstract: There exist many web based news provider applications (e.g. Pulse News reader application for iPhone/iPad and Android platforms) that gather news articles from the myriad ‘feeds’ the user is subscribed to and displays them with-in each feed category. A feed refers to a stream of news articles that may represent a news category (e.g. Gadgets, Technology, Sports, etc.) or may be very general (e.g. RSS feed provided by some newspaper for top articles, etc.). In this project, we analyze and develop machine learning techniques for personalized feed based news prediction and recommendation. Keywords: Natural Language Processing, Information Retrieval, Machine Learning. 1.1 Introduction: Of all the things people read these days, news has a large effect on their opinion formation regarding different issues. Also, newspapers and news services may manipulate the general mood of the people by showing them more sad news or happy news. Apart from opinion formation, one would also be interested in showing the reader the kind of news she is most interested in. Unlike conventional newspapers (which are same for all), the internet has revolutionized the information flow and access in user specific manner. Now, there exist many news services like Pulse application for iPhone/iPad that can provide personalized user experience by giving the users freedom to subscribe to the feeds (news sources) they are most interested in. Most of these services have the same set of news articles for each user in a particular feed. Thus, one level of personalization is where each user sees articles from a certain set of feeds that she can subscribe to (Figure 1), while articles in each feed remain same for all users. We propose a further level of personalization in which news articles in each feed are also personalized according to the user's interest. In order to accomplish this, we apply natural language processing and information retrieval techniques to extract features about the news articles and use these features in a machine learning setting to learn user behavior and interests. 2 Data The dataset comes from a local startup, Alphonse Labs, the developers of the Pulse application. The data consists of the activities of 1068 users for the month of August 2011. We have three types of files available: Figure 1: A sample snapshot of pulse application. 1) Stories: This consists of all the news articles (stories) available for reading with the following information: article-url, article title, feed-url, feed title, timestamp of first read (a proxy for when the article appeared in the feed). There were circa 1,65,000 unique stories, ~450 unique feeds. A story is identified by the (article-url, feed-url) tuple. 2) UserActivity: This consists of the stories the users have read, with the following information: user id, story url, story title, feed url, feed title, timestamp when user reads 3) ClickThroughs: This consists of the stories that users read in the web mode. Usually, users see stories in text mode in the app, a click-through event is defined as clicking a link to view a story in web mode that may indicate a higher user interest for that story. We have the following format for each entry in this file: user id, story url, story title, feed url, feed title, timestamp. 2.1 Curing the data: The first version of the dataset provided didn't have the timestamp information for 'Stories' file. Thus, we had no information about when that story became available. This