Detecting and Representing Relevant Web Deltas using Web Join SOURAV SBHOWMICK SANJAY MADRIA WEE KEONG NG EE-PENG LIM Centre for Advanced Information Systems , Department of Computer Science , School of Applied Science, Purdue University, Nanyang Technological University, Singapore 639798 West Lafayette, IN 47907 p517026,awkng,aseplim @ntu.edu.sg skm@cs.purdue.edu Abstract In this paper, we show how to detect and represent web deltas, i.e., changes in Web information, that are relevant to a user’s query in the context of our web warehousing system called WHOWEDA (Warehouse of Web Data). In WHOWEDA, Web information are materialized views stored in web tables and can be manipulated and analyzed using a set of web algebraic operators. In this paper, we present a mechanism to detect relevant web deltas using web join and outer web join. We show how to represent these changes using delta web tables. Keywords: web deltas, web warehousing, web tables, web join, outer web join. 1 Introduction Detecting changes to Web data is a challenging problem because the information sources in the Web are autonomous and typical database approaches to detect changes based on triggering mechanisms are not usable [4]. Consider the fol- lowing scenario. Example 1 Assume that there is a Web site at http://www.panacea.gov/ which provides infor- mation related to drugs used for various diseases. The Web page at www.panacea.gov (denoted by ) contains a list of diseases. From this list each link of a particular disease points to a web page (denoted by , , etc. for various drugs) containing a list of drugs used for prevention of the disease. From the hyperlinks associated with each drug, one can probe further to find document (denoted by , etc.) containing a list of various issues related to a particular drug, i.e., “description”, “uses”, “side-effects” etc.. From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug. Let us consider some modification to this Web site as shown in the Figures 1 and 2 respectively. These figures depicts the structure of this Web site as on 15th January, 2000 and February, 2000 respectively. Note that the black Uses d4 k5 AIDS Cancer Drug List Indavir Ritonavir Side effects Uses Side effects Uses b0 u0 u1 k1 d1 k0 d0 a0 Side effects uses b1 d2 k2 Heart Disease Hirudin Niasin Uses Side-effects uses b2 u2 u3 d3 k3 uses k4 Side-effects b3 Diabetes Vasomax Caverject uses Side effects b4 u4 u7 d6 Impotence Uses k6 u8 Disease Alzheimer’s Side effects uses b5 k12 d12 k7 u6 u5 Side effects d5 http://www.panacea.gov/ Figure 1. Web site on 15th January, 2000. boxes, patterned boxes and grey boxes in these figures de- pict addition of new documents, modification of existing documents and deletion of existing documents respectively. Furthermore, the dashed dotted arrows indicates addition, deletion or modification of hyperlinks. Suppose on 15th January, 2000, a user wish to find out periodically (say every 30 days), information related to side effects and uses of drugs for various diseases and also changes to these information compared to its previous ver- sion. This query requires access to previous states of the Web site and a mechanism to detect these changes automat- ically, features that are not supported by the Web or the ex- isting search engines. Thus, we need a mechanism to com- pute and represent changes in the context of Web data. Although there is an increasing research effort on querying the Web [6], there is very little work on change detection and representation of Web data. The AT & T Internet Differ-