A Scalable XML P2P Query System Giovanni Conforti Giorgio Ghelli Paolo Manghi Carlo Sartiani Dipartimento di Informatica - Universit ` a di Pisa Largo B. Pontecorvo 3 Pisa, Italy {confor,ghelli,manghi,sartiani}@di.unipi.it ABSTRACT This paper presents the architecture and the self-management algorithms of XPeer, a p2p XML database system. Unlike existing p2p systems, XPeer is capable of self-organizing its administrative layers, so to adapt to changes in the network topology and in the workload. The architecture of XPeer is based on two innovative con- cepts: the presence of two distinct subsystems (which we call overlays ), devoted, respectively, to the management of queries and of schema update requests; and the use of cloning in order to distribute load among the administrative peers. By exploiting these key features, the actual workload processing power of XPeer can scale linearly in the number of peers in the system. 1. INTRODUCTION Peer-to-peer is a term used with different meanings. We use it for a distributed system where every node is both a client and a server, and where nodes can freely join and freely leave the system. This implies that, if some peers per- form special administrative task, the system must be able to dynamically substitute them, whenever they just “go away”. The potential availability of many servers, and the tolerance to any sudden server disconnection, makes this architecture a good foundation for systems which may be extremely ro- bust and scalable. The huge popularity of p2p systems is mainly due the dif- fusion of some file-sharing and file-transfer protocols, which proved that such systems can actually be efficient and ro- bust in face of very high volatility. However, such systems are extremely limited in the kind of queries they can sup- port. Much of the research on the construction of real p2p data bases is currently aimed at the p2p decentralization of data integration mediators (e.g., Piazza [8], and CoDB [6]). These systems, usually based on the GLAV paradigm [7], enable each peer to reformulate queries according to a given set of mappings, with no need of centralized mediator. Schema Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. translation is crucial in many application fields. However, the need to set up the translation implies that some human administrative work is needed in order to join such systems. Other systems, like XPeer [12], address application fields where schema integration is not an issue, such as commu- nities where a well-known common schema is exploited, or situations where nobody is going to define a schema map- ping anyway, hence any query has to be exploratory. In this context, dynamicity and scalability are the central concerns. Our Contribution. This paper studies the working cost as well as the scalability properties of the XPeer p2p XML query system. XPeer supports a non-trivial query language, allowing the user to formulate queries in the FLWR core of XQuery [5], and its architecture can be used in any sys- tem supporting lookups on multiple keys or more complex queries. The system is characterized by the presence of two dis- tinct overlays (subsystems), one handling query requests, and the other managing update requests. These overlays communicate through periodic synchronization operations, and manage themselves with no need of human interven- tion. We study the cost of XPeer operations by first defin- ing a general model for p2p systems, and then by using the model to describe a family of systems incorporating more and more features of XPeer. In particular, we introduce two novel techniques for organizing p2p systems: the cloning of peers participating in an overlay, and the presence of two dis- tinct and functionally different overlays, and we show how the loose connection among the two overlays allows XPeer management protocols to scale up to any number of peers. Paper Outline. The paper is structured as follows. Section 2 illustrates the model we use for studying the query and up- date processing cost. Section 3 describes the XPeer system and analyzes its working cost and scalability; we proceed step-by-step by enriching a basic system with new features till the complete XPeer system is formed. Section 4 discusses some related work. Section 5 concludes. 2. SYSTEM MODEL We represent a p2p system as a set of interconnected peers P = {p1,...,pn}. Each peer pi manages a piece of XML data instance, as well as a schematic description of the data. The schema language of choice is XML Schema, but it can be any other language, provided that it meets two require- ments: schema selectivity and schema brevity. Selectivity refers to the main application of schemas in XPeer: XPeer