-Web Join in A Web Warehouse Sourav S Bhowmick Sanjay K Madria Wee-Keong Ng Ee-Peng Lim Centre for Advanced Information Systems Nanyang Technological University Singapore 639798 p517026, askumar, awkng, aseplim @ntu.edu.sg Abstract With the enormous amount of data stored in the World Wide Web, it is increasingly important to design and de- velop powerful web warehousing tools. The key objec- tive of our web warehousing project, called WHOWEDA (Warehouse of Web Data), is to design and implement a web warehouse that materializes and manages useful infor- mation from the Web. In this paper, we introduce the con- cept of -web join in the context of WHOWEDA. -web join operator is a web information manipulation operator to combine relevant web information residing in two web tables. Informally, it is the combination of web join and web project operators which filter out irrelevant informa- tion from a joined web table. In this paper, we show how to construct the -joined web table and its schema. We also highlight the benefits of the -web join operator. 1. Introduction Currently, web information may be discovered primar- ily by two mechanisms; browsers and search engines. This form of information access on the Web has a few short- comings [7]. To resolve these limitations, we introduced Web Information Coupling System (WICS) [7], a database system for managing and manipulating coupled information extracted from the Web. WICS is one of the component of our web warehouse, called WHOWEDA (Warehouse of Web Data) [1]. In WICS, we materialize web information as web tuples and store them in web tables. We equip WICS with the basic capability to manipulate web tables and correlate additional, useful, related web information residing in the web tables [14]. Note that a web table is a collection of This work was supported in part by the Nanyang Technological Uni- versity, Ministry of Education (Singapore) under Academic Research Fund #4-12034-5060, #4-12034-3012, #4-12034-6022. Any opinions, findings, and recommendations in this paper are those of the authors and do not reflect the views of the funding agencies. directed graphs (i.e., web tuples). We have introduced the web join operator in [8, 14] as a web information manipulation operator in WICS. The web join operator couples related information from two web ta- bles by concatenating a web tuple of one table with a web tuple of other table whenever there exists instances of join- able node variables (identical nodes). 1.1. Motivation Example 1 Assume a web site at http://www.panacea.org/ which stores disease and drug related information. Suppose we have the following two web tables in our warehouse constructed by coupling related information from the web site at http://www.panacea.org/: 1. Web table Diseases stores a list of diseases and their symptoms, treatments and evaluation details. Figures 1 and 3 depict the web schema of Diseases and a partial view of the web table respectively 1 . Note that the web schema is also called the query graph and is explic- itly specified by the user in order to couple information from the Web. 2. Web table Drugs stores a list of drugs for various dis- eases and their side effects and uses. Figures 2 and 4 describes the web schema of Drugs and a partial view of the web table respectively. Suppose a user wants to extract information related to symptoms of various diseases and side effects of drugs used for these diseases using WICS. Clearly, these information are already stored in tables Diseases and Drugs. The web join operator enabes us to relate the information from the 1 Note that in all figures in this paper, the boxes and directed lines corre- spond to nodes (Web documents) and links (hyperlinks) respectively. Ob- serve that some of the nodes and links have keywords imposed on them. These keywords express the content of the web document or the label of the hyperlink between the web documents. The dashed arrows signifies the existence of unbound node and/or link variables