Survivable optical grid dimensioning: anycast
routing with server and network failure protection
Chris Develder
∗
, Jens Buysse
∗
, Ali Shaikh
†
, Brigitte Jaumard
†
, Marc De Leenheer
∗
, and Bart Dhoedt
∗
∗
Ghent University – IBBT, Dept. of Information Technology – IBCN, Ghent, Belgium
†
CSE Dpt, Concordia University, Montreal (Qc) Canada
Email: chris.develder@intec.ugent.be, bjaumard@ciise.concordia.ca
Abstract—Grids can efficiently deal with challenging compu-
tational and data processing tasks which cutting edge science
is generating today. So-called e-Science grids cope with these
complex tasks by deploying geographically distributed server
infrastructure, interconnected by high speed networks. The latter
benefit from optical technology, offering low latencies and high
bandwidths, thus giving rise to so-called optical grids or lambda
grids.
In this paper, we address the dimensioning problem of such
grids: how to decide how much server infrastructure to deploy, at
which locations in a given topology, the amount of network capac-
ity to provide and which routes to follow along them. Compared
to earlier work, we propose an integrated solution solving these
questions in an integrated way, i.e., we jointly optimize network
and server capacity, and incorporate resiliency against both
network and server failures. Assuming we are given the amount
of resource reservation requests arriving at each network node
(where a resource reservation implies to reserve both processing
capacity at a server site, and a network connection towards it),
we solve the problem of first choosing a predetermined number of
server locations to use, and subsequently determine the routes to
follow while minimizing resource requirements. In a case study
on a meshed European network comprising 28 nodes and 41
links, we show that compared to classical (i.e. without relocation)
shared path protection against link failures only, we can offer
resilience against both single link and network failures by adding
about 55% extra server capacity, and 26% extra wavelengths.
I. I NTRODUCTION
Originating from so-called eScience applications (stemming
from various domains, such as astrophysics, climate model-
ing, and particle physics), Grids were envisioned: heteroge-
neous resources (computational, storage and networking) are
geographically spread (possibly over various administrative
domains, implying that resource coordination is not subject
to centralized control) to jointly provide the required com-
putational and storage capabilities. Similar ideas are applied
in cloud computing and virtualisation. These technologies
make network dimensioning a complex problem, especially
for providers needing to plan and deploy both network and
IT resources (i.e., servers, both for computing and storage). In
particular, since users typically do no longer care where their
workload is processed (“in the cloud”), freedom arises as to
where to install e.g., server farms. Thus, a (source,destination)-
based traffic matrix, as assumed in traditional (optical) network
dimensioning problems, including many routing and wave-
length assignment (RWA) approaches, is not a priori available.
In the current work, we assume the network interconnecting
the Grid server sites to be optical circuit-switched (such as an
ASON), based on Wavelength Division Multiplexing (WDM).
To deal with potential network failures, various network re-
silience strategies for WDM networks have been devised (for
an extensive overview, see [1], [2]). A well-known classical
shared path protection scheme protects against single link
failures: a primary path from source to destination is protected
by a link-disjoint backup path which is used in case of a failing
link (this link diversity guarantees that the primary and backup
paths will never fail simultaneously for any single link failure).
In a grid-like scenario however, we proposed the idea of
exploiting relocation [3], which is possible due to the anycast
routing principle. Since a user generally does not care about
the exact location where his workload is being processed, it
could be better to relocate the job to another resource (different
from the one chosen under failure-free conditions).
In this paper, we expand on the relocation idea to judge
the resource requirements to also cater for server site failures.
Without considering relocating jobs to other locations, provid-
ing resilience against server site failures would imply doubling
server capacity (if no service degradation is allowed) at each
location. However, by relocating to other sites (as in [4] for
link failure protection) we may be able to reduce the amount
of overall backup server capacity. In this paper, we assess
this reduction quantitatively by dimensioning the network to
survive both single link and server site failures. As a matter
of fact, the dimensioning model presented further is generic
and can address any failure scenario that can be expressed as
a set of jointly failing link resources, i.e., a so-called shared
risk link group (SRLG; e.g., to model failures such as fibre
duct cuts [5]). Note that node failure can be represented as
joint failure of its incident links (hence leading to an SRLG
comprising all of them).
We here consider individual connections between a source
site generating tasks to execute, and a server site processing
them. Thus, we do not consider provisioning virtual networks
interconnecting multiple sites. For an overview of such work,
see [6].
The remainder of this paper addresses the offline grid
dimensioning problem, first stated in [7] for the case with-
out resilience against failures. Whereas there we proposed
a phased approach, we now (i) solve the sub-problems of
establishing server and network capacity in an integrated way,
and (ii) additionally provide resiliency against both network
978-1-61284-231-8/11/$26.00 ©2011 IEEE
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE ICC 2011 proceedings