A Robustness Analysis of Datacenter Topologies Rodrigo S. Couto, Miguel Elias M. Campista, and Lu´ ıs Henrique M. K. Costa Universidade Federal do Rio de Janeiro - PEE/COPPE/GTA - DEL/POLI Email:{souza,miguel,luish}@gta.ufrj.br Abstract—The network infrastructure plays an important role for datacenter applications. Therefore, datacenter network archi- tectures are designed with three main goals: bandwidth, latency and robustness. This work focuses on the last goal and provides a comparative analysis of the topologies of prevalent datacenter architectures. Those architectures use a network based only on switches or a hybrid scheme of servers and switches to perform packet forwarding. We analyze failures of the main networking elements (link, server, and switch) to evaluate the tradeoffs of the different datacenter topologies. Considering only the network topology, our analysis provides a baseline study to the choice or design of a datacenter network with regard to robustness. Our results show that, as the number of failures increases, the considered hybrid topologies can substantially increase the path length, whereas servers on the switch-only topology tend to disconnect more quickly from the main network. I. I NTRODUCTION Currently, the time needed to complete an Internet trans- action is becoming a competitive factor among companies offering online services, such as web search, home banking, and shopping. The typical solution to reduce the response time of these services is distributed processing (e.g., MapRe- duce [1]). This strategy is more efficient if more servers in the datacenter execute the parts of a single task. As a consequence, the number of servers in datacenters is growing steadily fast. Google, for instance, has a computing infrastructure of almost 1 million servers spread in datacenters around the world [2]. Distributed processing often incurs in bulk data transfers be- tween servers. Nevertheless, communication between servers adds latency to the completion of a distributed task. More- over, high link utilization may lead to buffer congestion in switches, adding to latency. As big data migration is a potential slowdown for datacenter operations, distributed programming models use locality properties to choose the most appropriate server to store data. Ideally, one would plan for limiting data transfers to servers in a single rack. However, choosing the best server to store a specific piece of data is a difficult task, especially if we consider the ever increasing number of servers in datacenter networks. Thus, significant effort has been devoted to the development of new datacenter architectures which improve networking performance, while keeping the economical aspect in mind. One of the earliest architecture for datacenter networking is Fat-Tree [3], which mainly focuses on the utilization of off-the-shelf switches to avoid high costs. BCube [4] and DCell [5] are examples of architectures that use a combination of servers and switches to perform packet forwarding. The utilization of server-based forwarding allows those architectures to use switches with lower port density than Fat-Tree. Each architecture uses specific topology and routing protocols. For datacenters, networking performance is a function of three main metrics: bandwidth, latency, and robustness. De- spite the high available bandwidth achieved by these archi- tectures, datacenters are composed of tens of thousands of servers, which are prone to failures as well as the networking elements [6]. On the other hand, the datacenter must remain operational and present minimal impact to the user experience. To date, few studies compare existent architectures consider- ing failures on each one of the main networking elements, namely, servers, switches, and physical links. Popa et al. [7] compare the different architectures in terms of cost and en- ergy consumption, considering similar configurations to yield compatible performance. By analyzing the network capacity and maximum latency, they conclude that hybrid topologies (e.g. BCube) are cheaper than switch-only topologies (e.g. Fat-Tree). However, they foresee that switch-only topologies will become more cost-effective with the appearance of very low-cost switches in a near future. Guo et al. [4] address the robustness of the different topologies for specific traffic patterns and protocols, concluding that BCube is the most robust one. In this work, we analyze the network topologies of three of the main existent datacenter architectures (Fat-Tree, BCube, and DCell) in terms of robustness, adding to the cost and bandwidth comparisons found in the literature. The present robustness analysis does not depend on applications, routing algorithms, or traffic engineering strategies used by each architecture. Instead, it provides a baseline by using metrics to quantify the robustness of the datacenter network. These metrics can be combined with cost and available bandwidth metrics to help the datacenter designer. For example, the framework proposed by Curtis et. al. [8] proposes a datacen- ter topology optimizing metrics as available bandwidth and latency. However, it can be improved by using our definition of robustness evaluated in this work. In our analysis, we model the datacenter topology as a graph with servers and switches as nodes with network links connecting them. Using this model, we evaluate the impact of each networking component (server, switch and link) failure to the entire network. The results of our analysis show that the network degrades with disconnections of sets with a relatively small number of servers as the number of failures increases, for all considered topologies. We also show that hybrid topologies as BCube and DCell can substantially increase the average path length as the failures increase, whereas in Fat-Tree servers tend to