Implicitly Learning a User Interest Profile for Personalization of Web Search using Collaborative Filtering Ashish Nanda Dept. of Computer Science BITS-Pilani, Goa Campus Zuarinagar, India Email: f2010175@goa.bits-pilani.ac.in Rohit Omanwar Dept. of Computer Science BITS-Pilani, Goa Campus Zuarinagar, India Email: h2012060@goa.bits-pilani.ac.in Bharat Deshpande Head of Dept. of Computer Science BITS-Pilani, Goa Campus Zuarinagar, India Email: bmd@goa.bits-pilani.ac.in Abstract—The increasing abundance of content on the web has made information filtering even more important in helping users find information related to their interests. Personalization of web search is one such effort, that aims at improving the efficiency with which a user finds results relevant to his query. This is done by keeping track of a user’s individual interests, and taking it into account while returning search results. We propose a robust user modeling technique that implicitly creates a Dynamic Category Interest Tree (DCIT), using a general ontology of the web and a set of web pages collected over time that give an insight into a user’s interests. The DCIT is designed to use a fuzzy classification technique to keep track of what topics a user is interested in, his amount of interest in a topic, as well as reflect his changing interests overtime. The DCIT consists of a general ontology of the web, where each node represents a topic and consists of keywords that are usually used to describe that topic or category. Additional keywords that the user frequently associates with a topic, such as names of important people, organizations, or a specialized terminology, etc. are also incorporated into the relevant topic. We use the Apriori Algorithm to extract these associated words from the user’s web history in order to more accurately define the user’s categories of interest. The DCIT is initially created by a content based approach using only the browsing history of the user, and is later further enhanced through collaborative filtering using the k-nearest neighbour-based algorithm. We propose a technique to re-rank the results from a search engine according to their relevance to a user, based on his implicitly learned DCIT. According to experimental results, our DCIT based ranking often outperforms search engines such as Google when it comes to retrieving web pages that are more relevant to a user’s interest. Keywordspersonalized web search; ranking; user profile; implicit user interest I. I NTRODUCTION The World Wide Web is a great source of information for millions of users and has content spanning almost all topics at various abstraction levels. While this allows it to serve as a huge information resource, the diversity and sheer volume of available information often makes it difficult for users with different and specific interests who having varying levels of proficiency in each topic and require different levels of detail for the same, to find the web pages most relevant to them at any point of time. Search engines and several web applications are often built to serve all users in the same way with little or no adaptation to the user’s profile, namely their interests, preferences, and past behavior while using the application. Thus if a technology enthusiast would type the word “Apple” in a search engine, he would possibly expect results of the popular technology company Apple Inc., while a farmer would possibly be more interested in results pertaining to the fruit. Therefore in order to tackle the problem of recommending the relevant results to a user based on their interest profile across various topics, we have proposed a user modeling technique that creates a Dynamic Category Interest Tree, with the most general topics at the top of the tree, and with more specific subtopics at deeper levels in the tree. The Dynamic Category Interest Tree is designed to not only take into account the different topics a user is interested in and their general as well as user specific features and terms, but also reflect the changing interest of the user over time. The user profile is first created through content based personalization by keeping track of a user’s browsing patterns, and is later enriched further through collaborative filtering. We use this user profiling technique to filter the search results returned by Google for a user’s query, and re-rank the results based on the user’s profile. The text of a page is compared to the user’s profile and based on several ranking parameters a score is calculated, which is used to reorder the search results. The main merit of this technique is that, a fuzzy classification is employed while scoring topics of interest for a user, that changes with time as a user’s interests change, and hence the re-ranked results reflect both the long term and short term interests of a user. For e.g., when a user types the query “latest sports news” and is interested in “tennis” and “football”, news related to these sports will be ranked the highest. However if a user is reading web pages on “Operating Systems” in the first half of a year, then a query like “upcoming IEEE conferences” would return conferences in topics related to Operating Systems as the top ranked results, while in the next few months if the user’s interest changes to “Machine learning”, the same query will now rank pages related to conferences on Machine Learning higher. II. RELATED WORK A. Search Personalization Personalized web search was first proposed by Page et al. [1] by using a modified page rank algorithm which took 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) 978-1-4799-4143-8/14 $31.00 © 2014 IEEE DOI 10.1109/WI-IAT.2014.80 54 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) 978-1-4799-4143-8/14 $31.00 © 2014 IEEE DOI 10.1109/WI-IAT.2014.80 54