Mining Web Usage and Content structure Data to Improve Web Cache Performance in Content Aggregation Systems Carlos Guerrero Department of Mathematics and Computer Science University of Balearic Island Palma, E-07122, Spain Email: carlos.guerrero@uib.es Carlos Juiz Department of Mathematics and Computer Science University of Balearic Island Palma, E-07122, Spain Email: cjuiz@uib.es Ramon Puigjaner Department of Mathematics and Computer Science University of Balearic Island Palma, E-07122, Spain Email: putxi@uib.cat Abstract—Web cache performance has been reduced in Web 2.0 applications due to the increase of the content update rates and the higher number of personalized web pages. This problem can be minimized by the caching of content fragments instead of complete web pages. We propose a classification algorithm to define the fragment design that experiences the best performance. To create the algorithm, we have mined data of content characterization, user behaviour and performance. We have obtained two classification tree as result of this process. These classification trees are used to determine the fragment design. We have optimized the model of a real web site using both classification trees and we have evaluated the user observed response time. We have obtained significant results which prove that the optimization of the fragment designs can achieve high speedups in the user perceived response time. Keywords-Web caching; classification trees; web performance engineering; web content aggregation. I. I NTRODUCTION Web Caching is a widely used technique to save band- width, to reduce server workload and to improve user response time, i.e., Web Caching improves the performance of web architectures. This improvement is based on the reusability of web responses between different users and requests. This happens when several users request the same page or when a user requests the same web pages at least twice before its content changes. In current web architectures, especially in Web 2.0 sys- tems, content changes are more usual and the personalization of web pages is allowed. The behaviour of user generated content and pages created by collaboration are more un- predictable [1]. As a result of this, the web responses that are stored in the web cache become not reusable. It is widely accepted that this problem can be solved by reducing the minimum chacheable unit: content fragments instead of whole web pages [2], [3]. Nevertheless, from a performance point of view, there is a dilemma [4]: on one hand, a high level of fragmentation (a big number of content fragments) improves hit ratio, but response time could be increased due to overhead connection and fragment joining times; on the other hand, a low level of fragmentation (a small number of fragments) minimizes overhead times but it makes hit ratio worse. So that, the problem results on determining when it is better to serve two fragments together (joined) and when it is better to do it separately (split). To deal with this problem, we use a tree-based clas- sification algorithm which optimizes the performance of the system. This algorithm uses the characteristics of the contents of the page (content fragments sizes, update rates, request rates, . . . ) to obtain a design of the web pages (which elements are served split and which ones are served joined) for an optimal performance. We have compared the performance of using joined and split states in a high number of synthetic pages. We have applied data mining algorithms to the data obtained in these exploration phase and we have produced different classifi- cation trees. These classification trees have been tested in a web page model extracted from a real system (The New York Times web site, http://www.nytimes.com). The performance results obtained show high speedups of them with the two basic design alternatives: either all the fragments are joined or all the fragments are split. Our main contribution is the developing of a classification algorithm which improves the performance of the systems in where web pages are created by the aggregation of content atomic fragments. This solution can be applied to systems in where these fragments can be served joined or split. We represent these two ways of delivering the content fragments by a state of the aggregation relationship. The performance of the system changes depending on the state of each aggregation relationship. The inputs of the algorithm are the characteristics of the fragments. In Section III, we give the details about how the content aggregation application works and about the model we use to represent the contents fragments, the web pages, the characterization parameters of the fragments and the states of the aggregation relationships. In Section IV, we explain in which type of applications we can use our propose and how it fits in these applications. We explain the process 123 ICDT 2011 : The Sixth International Conference on Digital Telecommunications Copyright (c) IARIA, 2011. ISBN: 978-1-61208-127-4