A Study of Fault-tolerance Characteristics of Data Center Networks Yang Liu, Dong Lin, Jogesh Muppala, Mounir Hamdi Department of Computer Science and Engineering Hong Kong University of Science and Technology HKSAR, China {liuyangcse, ldcse, muppala, hamdi}@cse.ust.hk Abstract—We present an evaluation of the fault-tolerance characteristics of several important data center network topolo- gies, including Fat-tree, DCell, HyperBCube and BCube using several metrics, including average path length, aggregated bottleneck throughput and connection failure ratio. These enable us to present an objective comparison of the network topologies under faulty conditions. Keywords-Fault-Tolerant; Data Center Network; Evaluation I. I NTRODUCTION Data center infrastructure design has recently been receiv- ing significant attention both from academia and industry due to the growing importance of data centers in supporting and sustaining the rapidly growing Internet-based applica- tions including search, video content hosting and distribu- tion, social networking, and large-scale computations. The architecture of the network interconnecting the servers has a significant impact on the agility and recon- figurability of the data center infrastructure to respond to changing application demands and service requirements. Today, data center networks primarily use top of rack (ToR) switches that are interconnected through end of rack (EoR) switches, which are in turn connected via core switches. This architecture requires significant bandwidth towards the network core. This prompted several researchers to suggest alternate approaches for scalable cost-effective network ar- chitectures, based on topologies such as Fat-tree [1] [2] [3], Clos Network [4], DCell [5], FiConn [6], BCube [7] and HyperBCube [8]. These topologies can be classified into two groups: the tree-based topologies, such as Fat-tree and Clos Network, have only one NIC on each server, and scale up by adding more ports on the switches; the recursive topologies, such as DCell, FiConn, BCube and HyperBCube, can have multiple NICs on each server, and scale up by either adding more ports to the servers or more ports to the switches. Considering the large number of computers and switches used in a data center, a data center must be designed to be able to recover automatically from common failures, and maintain most of its performance in the interim. At the hardware level, most data center topologies employ redundant devices to ensure connectivity in case of hardware failures. Moreover, the routing algorithms running above make use of the redundancy offered by the hardware to recover from the failures. The primary goal of this paper is to make a fair com- parison of the fault-tolerance characteristics of four rep- resentative topologies: Fat-tree, DCell, HyperBCube and BCube. The Fat-tree is representative of a tree-based topol- ogy, while the rest are recursive topologies with different scalability properties. We mainly use three metrics, including aggregated bottleneck throughput, average path length and connection failure rate, to evaluate all the above topologies. The rest part of the paper is organized as follows: first, we give a brief introduction to the topologies. Thereafter, we introduce the basic properties of these topologies, and the metrics we use to evaluate them. Then, we present the fault- tolerance characteristics of the topologies. Then, we present some related work done by other researchers. Finally, we will give an overall conclusion. II. DATA CENTER NETWORK ARCHITECTURES A. Fat-tree &RUH $JJUHJDWLRQ (GJH     3RG  3RG  3RG  3RG  Figure 1. A 3-level Fat-Tree Topology The Fat-tree is an extended version of a tree topology [9] based on a complete binary tree. Fat-trees have been used as a topology for data centers by several researchers [1], [2]. Figure 1 shows a Fat-tree topology with n =4. B. DCell A DCell [5] is a recursively defined data center network topology. The most basic element of a DCell, which is called DCell 0 , consists of n servers and one n-port switch. Each server in a DCell 0 is connected to the switch in the same DCell 0 . Let DCell k be a level-k DCell. The first step is to construct a DCell 1 from several DCell 0 s. Each DCell 1 has n +1 DCell 0 s, and each server of every DCell 0 in a DCell 1 is connected to a server in another DCell 0 , respectively. As a 978-1-4673-2266-9/12/$31.00 ©2012 IEEE