Verifying Integrity Constraints on Web Sites Mary Fernandez A T & T Research 180 Park Ave. Florham Park, N.J 07932 USA mffresearch.att.com Daniela Florescu INRIA HP. 105 Rocquencourt be Chesnay cedex, Prance dana@rodin.inria.fr Alon Levy Dept. of Computer Science University of Washington Seattle, WA. 98195 USA alon@cs.Washington.edu Dan Suciu A T & T Research 180 Park Ave. Florham Park, NJ 07932 USA sueiu@research.att.com Abstract Data-intensive Web sites have created a new form of knowledge base, as richly structured bodies of data. Several novel systems for creating data- intensive Web sites support declarative specifica- tion of a site's structure and content (i.e., the pages, the data available in each page, and the links be- tween pages). Declarative systems provide a plat- form on which A1 techniques can be developed that, further simplify the tasks of constructing and main- taining Web sites. This paper addresses the prob- lem of specifying and verifying integrity constraints on a Web site's structure. We describe a language that can capture many practical constraints and an accompanying sound and complete verification algorithm. The algorithm has the important prop- erty that if the constraints are violated, it proposes fixes to either the constraints or to the site defini- tion. Finally, we establish tight bounds on the com- plexity of the verification problem we consider. 1 Introduction Data-intensive Web sites have created a new form of knowledge base. They typically contain and integrate several bodies of data about the enterprise they are de- scribing, and these bodies of data are linked into a rich structure. For example, a company's internal Web site may contain data about its employees, linked to data about the products they produce and/or to the cus- tomers they serve. The data in a Web site and the structure of the links in the site can be viewed as a richlv structured knowledge base. The management of data-intensive Web sites has re- ceived significant attention in the database commu- nity [Fernandez et al ., 1998; Atzeni e t al., 1998; Aroccna and Mendelzon, 1998; Chiet et al ., 1998; Paolini and Fraternal], 1998]. The key insight of recent systems is to specify the structure and content of sites dedaratively. These systems separate and provide direct support for the three primary steps of site creation: (1) identify- ing and accessing the data served at the site, (2) defin- ing the site's structure (i.e., the pages, the data in each page, and the links between pages), and (3) specifying the HTML rendering of the site's pages. Step 2 is usually supported by a declarative, specification language. Web-site management systems based on declarative representations offer several benefits. First, since a site's structure and content are defined dedaratively, not pro- cedurally by a program, it is easy to create multiple versions of a site. For example, it is possible to build internal and external views of an organization's site or to build sites tailored to novice or expert users. Cur- rently, creating multiple versions requires writing multi- ple sets of programs or manually creating different sets of HTML files. Second, these systems support the evo- lution of a site's structure. For example, to reorganize pages based on frequent usage patterns or to extend the site's content, we simply rewrite the site's specification. Another advantage is efficient update of a site when its data sources change. Declarative Web-site management systems also allow us to view a site's definition and its content as a knowl- edge base. A natural next step is to consider how reason- ing techniques can further improve the process of build- ing and maintaining Web sites. We consider the reason- ing problem of verifying integrity constraints over Web sites. Specifically, when the structure of a site becomes complex, it is hard for a designer to ensure that the site will satisfy a set of desired properties. For example, we may want to enforce that all pages are reachable from the root, every organization homepage points to the home- pages of its sub-organizations, or proprietary data is not displayed on the external version of the site. A study on the usability of on-line stores [Lohse and Spiller, 1998] provides other constraints that if followed, would im- prove the site design. For a verification tool to be useful, if must verify con- straints against a site definition, not a particular in- stance of the site, because (1) we do not want to verify the constraints every time the site instance changes, and (2) if a Web site is dynamically generated, an instance is never completely materialized making it is impossible to check the constraints. Verifying the constraints on the site definition ensures that as long as the site is generated according to the definition, the constraints will be sat- isfied. For this reason, the verification problem requires reasoning, and not just applying a procedure to the site. Furthermore, when the integrity constraints are not ver- 614 KNOWLEDGE-BASED APPLICATIONS