Gossip-based Service Coordination for Scalability and Resilience Filipe Campos Qimonda Portugal S.A. (Trainee/Internship, 2008) fcampos@di.uminho.pt José Pereira Universidade do Minho jop@di.uminho.pt ABSTRACT Many interesting emerging applications involve the coordi- nation of a large number of service instances, for instance, as targets for dissemination or sources in information gather- ing. These applications raise hard architectural, scalability, and resilience issues that are not suitably addressed by cen- tralized or monolithic coordination solutions. In this paper we propose a lightweight approach to ser- vice coordination aimed at such application scenarios. It is based on gossiping and thus potentially fully decentral- ized, requiring that each participant is concerned only with a small number of peers. Although being obviously simple and scalable, it has been shown that gossip-based protocols lead to emergent strong resilience guarantees. We illustrate the approach with WS–PushGossip, a proof- -of-concept coordination protocol based upon the WS–Coor- dination framework. Besides presenting WS–PushGossip, we illustrate its usefulness with a sample application, and outline a middleware implementation based on Apache Axis2. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications; D.2.11 [Software Architectures]: Patterns General Terms Design, Performance, Reliability Keywords Web Services, Gossip 1. INTRODUCTION As service-oriented computing matures and becomes wide- spread, there is an increasing demand for applications in- volving very large numbers of coordinated services. For in- stance, in systems management it is often necessary to ag- gregate and then query information amassed from a large Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MW4SOC ’08, December 1, 2008, Leuven, Belgium Copyright 2008 ACM 978-1-60558-368-6/08/12 ...$5.00. number of sources. More often, the goal is to disseminate information to a very large number of interested parties, as attested by the growing interest in notification services, as described in Section 2.1. As an example, consider a trading floor scenario in which stock market information is disseminated to a number of trader workstations and automatic trading systems. This way, each node maintains a local copy of the list of stock values with which a client application may interact. This scenario has traditionally been addressed by mono- lithic applications and group communication protocols [25], but it is increasingly interesting in a service-oriented ap- proach as stock markets and trading systems become in- creasingly interconnected and interoperable. Anecdotal ev- idence for this is its usage to motivate multiple research efforts [23, 15, 14] and also as sample code for popular mid- dleware packages [1]. Stock trading systems have however very stringent re- silience and scalability requirements, that are hard to achieve even with existing monolithic implementations [25]. Specifi- cally, it is very hard to achieve stable high throughput when the number of participants is very large, even if the network topology and conditions are stable. Such stability is an es- sential guarantee for these systems where high volumes of data are transferred with tight timeliness requirements. The same requirements exist, for instance, in automated produc- tion management systems as deployed in the semi-conductor industry. Furthermore, it has been pointed out that this is a fun- damental limitation of reliable information dissemination based on feedback mechanisms [11]. The problem stems from messages being buffered at multiple locations until fully ac- knowledged by all destinations, to deal with node and net- work faults. A single slow receiver, or worse yet, multiple transient perturbations, can thus delay acknowledgment and garbage collection, leading to degraded throughput. Current state-of-the-art is that stable high throughput can be achieved by using gossip-based, or epidemic, pro- tocols [12]. As described in Section 2.2, such protocols are also highly resilient to network and process faults, while scaling to large number of participants and high message throughput. Gossip protocols are, for instance, a key tech- nology within Amazon.com Web Services implementation infrastructure [28]. The goal of this paper is to leverage gossiping in service- oriented computing as an high level structuring paradigm, thus inherently achieving scalability and resilience when co- ordinating large numbers of services.