A Study of Fault-tolerance Characteristics of Data Center Networks
Yang Liu, Dong Lin, Jogesh Muppala, Mounir Hamdi
Department of Computer Science and Engineering
Hong Kong University of Science and Technology
HKSAR, China
{liuyangcse, ldcse, muppala, hamdi}@cse.ust.hk
Abstract—We present an evaluation of the fault-tolerance
characteristics of several important data center network topolo-
gies, including Fat-tree, DCell, HyperBCube and BCube using
several metrics, including average path length, aggregated
bottleneck throughput and connection failure ratio. These
enable us to present an objective comparison of the network
topologies under faulty conditions.
Keywords-Fault-Tolerant; Data Center Network; Evaluation
I. I NTRODUCTION
Data center infrastructure design has recently been receiv-
ing significant attention both from academia and industry
due to the growing importance of data centers in supporting
and sustaining the rapidly growing Internet-based applica-
tions including search, video content hosting and distribu-
tion, social networking, and large-scale computations.
The architecture of the network interconnecting the
servers has a significant impact on the agility and recon-
figurability of the data center infrastructure to respond to
changing application demands and service requirements.
Today, data center networks primarily use top of rack (ToR)
switches that are interconnected through end of rack (EoR)
switches, which are in turn connected via core switches.
This architecture requires significant bandwidth towards the
network core. This prompted several researchers to suggest
alternate approaches for scalable cost-effective network ar-
chitectures, based on topologies such as Fat-tree [1] [2] [3],
Clos Network [4], DCell [5], FiConn [6], BCube [7] and
HyperBCube [8]. These topologies can be classified into two
groups: the tree-based topologies, such as Fat-tree and Clos
Network, have only one NIC on each server, and scale up by
adding more ports on the switches; the recursive topologies,
such as DCell, FiConn, BCube and HyperBCube, can have
multiple NICs on each server, and scale up by either adding
more ports to the servers or more ports to the switches.
Considering the large number of computers and switches
used in a data center, a data center must be designed to
be able to recover automatically from common failures,
and maintain most of its performance in the interim. At
the hardware level, most data center topologies employ
redundant devices to ensure connectivity in case of hardware
failures. Moreover, the routing algorithms running above
make use of the redundancy offered by the hardware to
recover from the failures.
The primary goal of this paper is to make a fair com-
parison of the fault-tolerance characteristics of four rep-
resentative topologies: Fat-tree, DCell, HyperBCube and
BCube. The Fat-tree is representative of a tree-based topol-
ogy, while the rest are recursive topologies with different
scalability properties. We mainly use three metrics, including
aggregated bottleneck throughput, average path length and
connection failure rate, to evaluate all the above topologies.
The rest part of the paper is organized as follows: first,
we give a brief introduction to the topologies. Thereafter, we
introduce the basic properties of these topologies, and the
metrics we use to evaluate them. Then, we present the fault-
tolerance characteristics of the topologies. Then, we present
some related work done by other researchers. Finally, we
will give an overall conclusion.
II. DATA CENTER NETWORK ARCHITECTURES
A. Fat-tree
&RUH
$JJUHJDWLRQ
(GJH
3RG 3RG 3RG 3RG
Figure 1. A 3-level Fat-Tree Topology
The Fat-tree is an extended version of a tree topology [9]
based on a complete binary tree. Fat-trees have been used as
a topology for data centers by several researchers [1], [2].
Figure 1 shows a Fat-tree topology with n =4.
B. DCell
A DCell [5] is a recursively defined data center network
topology. The most basic element of a DCell, which is called
DCell
0
, consists of n servers and one n-port switch. Each
server in a DCell
0
is connected to the switch in the same
DCell
0
.
Let DCell
k
be a level-k DCell. The first step is to
construct a DCell
1
from several DCell
0
s. Each DCell
1
has
n +1 DCell
0
s, and each server of every DCell
0
in a DCell
1
is connected to a server in another DCell
0
, respectively. As a
978-1-4673-2266-9/12/$31.00 ©2012 IEEE