A Web-Services Architecture for Efficient XML Data Exchange Sihem Amer-Yahia AT&T Labs–Research 180 Park Ave, Florham Park, NJ 07932 sihem@research.att.com Yannis Kotidis AT&T Labs–Research 180 Park Ave, Florham Park, NJ 07932 kotidis@research.att.com Abstract Business applications often exchange large amounts of enterprise data stored in legacy systems. The advent of XML as a standard specification format has improved applica- tions interoperability. However, optimizing the performance of XML data exchange, in particular, when data volumes are large, is still in its infancy. Quite often, the target sys- tem has to undo some of the work the source did to assem- ble documents in order to map XML elements into its own data structures. This publish&map process is both resource and time consuming. In this paper, we develop a middle-tier Web services ar- chitecture to optimize the exchange of large XML data vol- umes. The key idea is to allow systems to negotiate the data exchange process using an extension to WSDL. The source (target) can specify document fragments that it is willing to produce (consume). Given these fragmentations, the middle-ware instruments the data exchange process be- tween the two systems to minimize the number of neces- sary operations and optimize the distributed processing be- tween the source and the target systems. We show that our new exchange paradigm outperforms publish&map and en- ables more flexible scenarios without necessitating substan- tial modifications to the underlying systems. 1. Introduction Large organizations use a plethora of systems to support their daily operations. Depending on the application, a sys- tem may act as a data broker by disseminating information that is consumed by the receiving applications. For instance, in a telecom provider like AT&T, a sales and ordering sys- tem provides an interface to extract data on customer or- ders. This data is used to drive a provisioning process that implements changes to the physical network in order to sup- port the line features requested by customers. Finally, this data, along with usage information generated from the op- eration centers, is consumed by a biller to setup customer accounts in order to collect revenue. In such real-world ap- plications, the data that is exchanged can reach very large volumes. As an example, usage data from the telephony net- work easily exceeds 60GB per day. In this paper, we focus on the optimization of data exchange between two applica- tions when the amount of data is large. In order to collaborate, applications implement pair-wise agreements that define the format of the data to exchange. To this end, XML is most commonly used. Web services are typical examples that use XML as the grammar for de- scribing services on the network as a collection of systems capable of exchanging data and messages. The specifica- tion of a Web Service Description Language (WSDL) doc- ument hides the details involved in data communication by focusing on the format in which that data is being produced and consumed and the services that are provided at each endpoint [12]. However, optimizing the performance of ex- changing large data volumes has not attracted a lot of at- tention. Quite often, the target system has to undo some of the work the source did to assemble documents in order to map XML elements into its own data structures. This pro- cess is both resource and time consuming. In a typical data exchange scenario, that we will refer to as publish&map, XML documents are built at a source ap- plication and shipped to be consumed at a target one. The process of publishing an XML document from stored data translates often to costly combine operations (through joins in case of relational stores) that piece document fragments together. Quite often, some of these fragments are stored similarly at the source and at the target systems, in which cases combining such fragments at the source is unneces- sary because the target system will split them again into its internal structures. Furthermore, in publish&map, XML documents are built at the source and consumed at the tar- get, imposing a strict processing distribution that does not explore the capabilities of the underlying systems. 1.1. Motivating Example We sketch a typical exchange scenario between a sales and ordering system in which data is stored in a relational