Mining frequent generalized patterns for Web personalization Panagiotis Giannikopoulos Iraklis Varlamis Magdalini Eirinaki University of Peloponnese, Department of Computer Science and Technology, Tripoli, Greece cst04006@uop.gr Athens University of Economics and Business, Department of Informatics, Athens, Greece varlamis@aueb.gr San Jose State University, Computer Engineering Department, San Jose, CA, US, magdalini.eirinaki@sjsu.edu Abstract. In this paper we present FGP, an algorithm that combines the powers of an association rule mining algorithm (FP-Growth) and a generalized pattern mining algorithm (GP- Close) in order to efficiently generate rules from transaction data. Our Frequent Generalized Pattern (FGP) algorithm considers that all items that appear in a set of transactions, belong to categories organized in a taxonomy. It takes as input the transaction database and the taxonomy of categories and produces generalized association rules that contain transaction items and/or item categories. This algorithm is particularly useful for personalizing web sites with continuously updated content, such as, blog aggregators, or news portals. In this context, the transaction database contains user click-stream information and the hierarchy of item types is a thematic taxonomy of web pages. The algorithm generates frequent itemsets comprising of both web pages and categories. The results are used to generate association rules and consequently recommendations for the users. We experimentally evaluate the proposed algorithm using web log data collected from a newspaper web site. 1. INTRODUCTION The role of recommendations is very important in everyday transactions. When buying a product, or reading a newspaper article, one would like to have recommendations on related items. To achieve this, recommendation engines first build a predictive model, by discovering itemsets or item sequences with high support among users. Recommendations are subsequently generated by matching new transaction patterns to the predictive model. Most current approaches in web personalization consider that a web site consists of a finite number of web pages and build their predictive models based on this assumption [10]. The Web, however, is a continuously evolving environment, and this assumption does no longer hold. Social networking structures, such as blog aggregators, and news portals are typical examples of this situation since their content is updated on a regular basis. As a result, the traditional, usage-based approach that takes as input the navigation paths recorded on the web page level is not as effective. Since most predictive models are based on frequent itemsets, the more recent a page is, the more difficult it is to become part of the recommendation set; at the same time, such pages are more likely to be of interest for the average user. This problem can be addressed by generalizing the page-level navigation patterns to a higher, aggregate level [3, 9]. In this work, we address the aforementioned problem by modifying and combining two algorithms that have been proposed in different contexts. The first algorithm, FP-Growth [5], considers a database of user transactions that comprise one or more unordered items (itemsets) and a minimum support threshold. The algorithm processes the transaction database and mines the complete set of frequent itemsets (whose frequency surpasses the threshold). FP-Growth considers the support of each item in the set to be equal to one. We extend the algorithm so that it assigns different weights to every item in the set depending on its importance in the transaction. The algorithm considers no relation between items in the database, but this is not the case in the web, where items in a web site are (conceptually) hierarchically organized. This characteristic is tackled by the second algorithm, GP-Close [6, 7], that was proposed independently from FP-Growth. GP-Close considers a hierarchical organization of all items in the transaction database and uses this information to produce generalized patterns. The two algorithms are very efficient and solve many of the problems of pattern mining, such as the costly generation of candidate sets and the over-generalization of rules. In this paper, we combine the forces of the two algorithms in one efficient generalized pattern mining algorithm, which: a) extends the main structure of FP-Growth, the FP-Tree, to include weight information about items, thus producing a weighted FP- Tree (WFP-Tree) and, b) addresses the problem of continuously updated content by using the WFP-Tree and the taxonomic information about a web site's content as input to the GP-Close algorithm, and generates generalized recommendations. We experimentally evaluate our approach using web log data and content collected from a newspaper's web site. The paper is organized as follows. First, we provide an overview of the related research in the area of pattern and association rule mining, as well as in the area of personalizing news sites. We briefly describe the fundamentals of the FP- Growth and GP-Close algorithms, and we present the details of the FGP algorithm in Sections 3 and 4 respectively. In Section 5, we discuss a proof-of-concept implementation and present preliminary experimental results. We conclude with our plans for future work in Section 6. 2. RELATED WORK Numerous approaches exist that address the problem of personalizing a web site. An extensive overview can be found in [10]. Here we overview those that generalize the predicted patterns using a hierarchy. The problem with sites such as blogs or news portals, is that their content is continuously updated. Moreover, in the case of blog aggregators we have less control on the tags assigned to each item. Since they do not belong to a hierarchy we need to put extra effort to assign them to a hierarchy node (i.e. using