System RX: One Part Relational, One Part XML Kevin Beyer 1 Roberta J. Cochrane 1 Vanja Josifovski 1 Jim Kleewein 2 George Lapis 2 Guy Lohman 1 Bob Lyle 2 Fatma Özcan 1 Hamid Pirahesh 1 Normen Seemann 2 Tuong Truong 2 Bert Van der Linden 2 Brian Vickery 2 Chun Zhang 1 1 IBM Almaden Research Center 650 Harry Road San Jose CA 95120 2 IBM Silicon Valley Lab 555 Bailey Road San Jose CA 95141 Abstract This paper describes the overall architecture and design aspects of a hybrid relational and XML database system called System RX. We believe that such a system is fundamental in the evolution of enterprise data management solutions: XML and relational data will co-exist and complement each other in enterprise solutions. Furthermore, a successful XML repository requires much of the same infrastructure that already exists in a relational database management system. Finally, XML query languages have con- siderable conceptual and functional overlap with relational data- flow engines. System RX is the first truly hybrid system that co- mingles XML and relational data, giving them equal footing. The new support for XML includes native support for storage and indexing as well as query compilation and evaluation support for the latest industry-standard query languages, SQL/XML and XQuery. By building a hybrid system, we leverage more than 20 years of data management research to advance XML technology to the same standards expected from mature relational systems. 1. Introduction XML first became a W3C recommendation in February 1998, as a standard way to delimit text data [42]. It has emerged in the industry as the predominant mechanism for representing and ex- changing structured and semi-structured information across the Internet, between applications, and within an intranet. Virtually every industry is working to standardize XML representations for their common business objects. As one industry analyst put it, "Hundreds of vertical schemas, in fields as diverse as government, biology, finance, and travel, are publicly available and being ac- tively used. Undoubtedly, there are thousands more in private hands" [5]. With the advent of Web services and services-oriented architec- tures, it is quite common for intra-company and inter-company interactions to be processed via XML messages. In such cases, the message is more than the transaction request; it is also a business artifact: a purchase order, an order inquiry, an invoice, etc. Such messages need to be retained for many reasons including auditing, regulatory compliance, and non-repudiation. For example, a large securities clearing house interacting with member brokers using Web services is legally obliged to store the XML messages for non-repudiation. Many of these uses also require extensive search capabilities, and the XML storage must have very high fidelity to preserve digital signatures as required for non-repudiation. So, although XML’s original intent was data interchange, an increas- ing amount of XML is designed to be persistently stored, and enterprises are even persisting XML messages primarily used for data interchange for later analysis. A large percentage of industries rely heavily on existing relational databases and applications to run their businesses, from which much of the information within the XML document is generated, or into which much of the information from the XML documents will be stored. We believe that the integration of this well- structured relational information with the self-describing XML data is an important evolutionary advance in the data industry. This paper describes the overall architecture and design aspects of a hybrid relational and XML database system called System RX. The system understands both relational and XML data deeply, with new support for XML throughout the system, including na- tive support for storage and indexing, as well as query compila- tion and evaluation support for the latest industry-standard query languages, SQL/XML and XQuery. System RX is an experimental prototype that is currently being implemented as an extension to DB2 UDB. This paper describes the overall architecture and the design of the system. Later papers will describe major subsystems in more detail. There are three driving factors that led us to build a hybrid rela- tional and XML database system: (1) XML and relational data will co-exist and complement each other in enterprise solutions. Some types of data are best modeled and stored in a relational format, but other types are best suited for XML. Although the data could be normalized into rela- tional tables, it may not be appropriate to do so. There are many examples of this. (a) The data comes from a multiplicity of sche- mas and the aggregate size of the relational schema to model all the data is unacceptable given the usage. For example, an organi- zation with 1,500 e-forms required over 30,000 relational tables to represent their data, despite the fact that most forms are seldom used. (b) The data has a highly variable schema with respect to time. We refer to such schemas as dynamic schemas. The impact of changing the corresponding relational schema frequently makes it impractical to model the data in relations. This is particu- larly pronounced when the corresponding schema change would require normalization, such as making a single-valued attribute into multi-valued. (c) The data contains many sparse attributes that are only accessed in the context of the parent object. Thus, the cost of normalization is prohibitive and de-normalization is impractical because of limits on the maximum width of a row or maximum number of columns in a table. Hence, there is a need to persist and search XML natively along-side relational. ______-__________________________________________________ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, re- quires prior specific permission and/or a fee. SIGMOD 2005, June 14–16, 2005, Baltimore, Maryland, USA. Copyright 2005 ACM 1-59593-060-4/05/06 $5.00. 347