Computer Networks 133 (2018) 195–211 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet On reliability improvement of Software-Defined Networks Shadi Moazzeni a , Mohammad Reza Khayyambashi a,* , Naser Movahhedinia a , Franco Callegati b a Department of Computer Architecture, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran b Department of Computer Science and Engineering, University of Bologna, via Venezia 52, Cesena, FC 47521, Italy a r t i c l e i n f o Article history: Received 26 March 2017 Revised 6 November 2017 Accepted 17 January 2018 Keywords: Software-Defined Networks Distributed controllers Reliability Failure detection Fast failure recovery Coordinator controller a b s t r a c t In Software-Defined Networks (SDNs) the role of the centralized controller is crucial, and thus it becomes a single point of failure. In this work, a distributed controller architecture is explored as a possible so- lution to improve fault tolerance. A network partitioning strategy, with small subnetworks, each with its own Master controller, is combined with the use of Slave controllers for recovery aims. A novel formula is proposed to calculate the reliability rate of each subnetwork, based on the load and considering the number and degree of the nodes as well as the loss rate of the links. The reliability rates are shared among the controllers through a newly-designed East/West bound interface, to select the coordinator for the whole network. This proposed method is called “Reliable Distributed SDN (RDSDN).” In RDSDN, the failure of controllers is detected by the coordinator that may undertake a fast recovery scheme to re- place them. The numerical results prove performance improvement achievable with the adoption of the RDSDN and show that this approach performs better regarding failure recovery compared to methods used in related research. © 2018 Elsevier B.V. All rights reserved. 1. Introduction and motivation Software-Defined Networking (SDN) has recently emerged as a novel paradigm to overcome the challenges related to the control plane of modern communication networks [1,2]. The brain of the control plane is the so-called SDN controller, which typically talks with network devices through a Southbound Interface (SBI) such as the OpenFlow protocol [3]. The control plane exposes some fea- tures and APIs through the Northbound Interface (NBI) to network operators to design various management applications exploiting, for instance, a set of REST APIs [4,5]. The centralized control plane approach of SDN promises controllable networks but raises a reli- ability issue since the SDN controller may turn into a centralized point of failure. This is a known issue, and several countermeasures have been proposed. We have reviewed these works in Section 2. In this article the goal is to consider the data plane and con- trol plane reliability as a combined issue, proposing a solution that combines network partitioning, controllers’ coordination, and data plane reliability characteristics to enhance the overall network re- silience. * Corresponding author. E-mail addresses: moazzeni@eng.ui.ac.ir (S. Moazzeni), m.r.khayyambashi@comp. ui.ac.ir (M.R. Khayyambashi), naserm@eng.ui.ac.ir (N. Movahhedinia), franco.callegati@unibo.it (F. Callegati). URL: http://eng.ui.ac.ir/~m.r.khayyambashi (M.R. Khayyambashi) To reduce the effect of the data plane or controller failures, it is assumed that a whole network domain can be partitioned into subnetworks. Each subnetwork is controlled by a Master con- troller and has one or more controllers of the other subnetworks as Slave controllers. Each subnetwork’s Master controller calculates the reliability rate by exploiting the newly proposed formula. The reliability rates are shared periodically among controllers using edge switches through a newly designed East/West bound inter- face. There may be backup control routes in addition to the main routes to improve fault coverage. The controller which has the best reliability rate would be selected as the coordinator who checks the status of the other controllers, periodically. This newly pro- posed method is called “Reliable Distributed SDN (RDSDN)” which aims to improve the reliability of SDNs with distributed controllers. Through the detection phase, the coordinator detects any non- active controller and will decide which other controller is more appropriate to take over the subnetwork according to the cached reliability rates and then will trigger the fast recovery scheme un- til the failed controller is repaired. Therefore, the created inertia is attenuated. If the coordinator crashes or a better controller exists, a new one will be chosen by election. The paper is organized as follows: A review of the most impor- tant issues in SDN reliability and the related studies are presented in Section 2. The main contribution containing the state-of-the-art method for calculating the reliability rate and describing RDSDN is in Section 3. The pilot implementation of our work, including https://doi.org/10.1016/j.comnet.2018.01.023 1389-1286/© 2018 Elsevier B.V. All rights reserved.