Survivable optical grid dimensioning: anycast routing with server and network failure protection Chris Develder , Jens Buysse , Ali Shaikh , Brigitte Jaumard , Marc De Leenheer , and Bart Dhoedt Ghent University – IBBT, Dept. of Information Technology – IBCN, Ghent, Belgium CSE Dpt, Concordia University, Montreal (Qc) Canada Email: chris.develder@intec.ugent.be, bjaumard@ciise.concordia.ca Abstract—Grids can efficiently deal with challenging compu- tational and data processing tasks which cutting edge science is generating today. So-called e-Science grids cope with these complex tasks by deploying geographically distributed server infrastructure, interconnected by high speed networks. The latter benefit from optical technology, offering low latencies and high bandwidths, thus giving rise to so-called optical grids or lambda grids. In this paper, we address the dimensioning problem of such grids: how to decide how much server infrastructure to deploy, at which locations in a given topology, the amount of network capac- ity to provide and which routes to follow along them. Compared to earlier work, we propose an integrated solution solving these questions in an integrated way, i.e., we jointly optimize network and server capacity, and incorporate resiliency against both network and server failures. Assuming we are given the amount of resource reservation requests arriving at each network node (where a resource reservation implies to reserve both processing capacity at a server site, and a network connection towards it), we solve the problem of first choosing a predetermined number of server locations to use, and subsequently determine the routes to follow while minimizing resource requirements. In a case study on a meshed European network comprising 28 nodes and 41 links, we show that compared to classical (i.e. without relocation) shared path protection against link failures only, we can offer resilience against both single link and network failures by adding about 55% extra server capacity, and 26% extra wavelengths. I. I NTRODUCTION Originating from so-called eScience applications (stemming from various domains, such as astrophysics, climate model- ing, and particle physics), Grids were envisioned: heteroge- neous resources (computational, storage and networking) are geographically spread (possibly over various administrative domains, implying that resource coordination is not subject to centralized control) to jointly provide the required com- putational and storage capabilities. Similar ideas are applied in cloud computing and virtualisation. These technologies make network dimensioning a complex problem, especially for providers needing to plan and deploy both network and IT resources (i.e., servers, both for computing and storage). In particular, since users typically do no longer care where their workload is processed (“in the cloud”), freedom arises as to where to install e.g., server farms. Thus, a (source,destination)- based traffic matrix, as assumed in traditional (optical) network dimensioning problems, including many routing and wave- length assignment (RWA) approaches, is not a priori available. In the current work, we assume the network interconnecting the Grid server sites to be optical circuit-switched (such as an ASON), based on Wavelength Division Multiplexing (WDM). To deal with potential network failures, various network re- silience strategies for WDM networks have been devised (for an extensive overview, see [1], [2]). A well-known classical shared path protection scheme protects against single link failures: a primary path from source to destination is protected by a link-disjoint backup path which is used in case of a failing link (this link diversity guarantees that the primary and backup paths will never fail simultaneously for any single link failure). In a grid-like scenario however, we proposed the idea of exploiting relocation [3], which is possible due to the anycast routing principle. Since a user generally does not care about the exact location where his workload is being processed, it could be better to relocate the job to another resource (different from the one chosen under failure-free conditions). In this paper, we expand on the relocation idea to judge the resource requirements to also cater for server site failures. Without considering relocating jobs to other locations, provid- ing resilience against server site failures would imply doubling server capacity (if no service degradation is allowed) at each location. However, by relocating to other sites (as in [4] for link failure protection) we may be able to reduce the amount of overall backup server capacity. In this paper, we assess this reduction quantitatively by dimensioning the network to survive both single link and server site failures. As a matter of fact, the dimensioning model presented further is generic and can address any failure scenario that can be expressed as a set of jointly failing link resources, i.e., a so-called shared risk link group (SRLG; e.g., to model failures such as fibre duct cuts [5]). Note that node failure can be represented as joint failure of its incident links (hence leading to an SRLG comprising all of them). We here consider individual connections between a source site generating tasks to execute, and a server site processing them. Thus, we do not consider provisioning virtual networks interconnecting multiple sites. For an overview of such work, see [6]. The remainder of this paper addresses the offline grid dimensioning problem, first stated in [7] for the case with- out resilience against failures. Whereas there we proposed a phased approach, we now (i) solve the sub-problems of establishing server and network capacity in an integrated way, and (ii) additionally provide resiliency against both network 978-1-61284-231-8/11/$26.00 ©2011 IEEE This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE ICC 2011 proceedings