Computer Networks 133 (2018) 195–211
Contents lists available at ScienceDirect
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
On reliability improvement of Software-Defined Networks
Shadi Moazzeni
a
, Mohammad Reza Khayyambashi
a,*
, Naser Movahhedinia
a
,
Franco Callegati
b
a
Department of Computer Architecture, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran
b
Department of Computer Science and Engineering, University of Bologna, via Venezia 52, Cesena, FC 47521, Italy
a r t i c l e i n f o
Article history:
Received 26 March 2017
Revised 6 November 2017
Accepted 17 January 2018
Keywords:
Software-Defined Networks
Distributed controllers
Reliability
Failure detection
Fast failure recovery
Coordinator controller
a b s t r a c t
In Software-Defined Networks (SDNs) the role of the centralized controller is crucial, and thus it becomes
a single point of failure. In this work, a distributed controller architecture is explored as a possible so-
lution to improve fault tolerance. A network partitioning strategy, with small subnetworks, each with its
own Master controller, is combined with the use of Slave controllers for recovery aims. A novel formula
is proposed to calculate the reliability rate of each subnetwork, based on the load and considering the
number and degree of the nodes as well as the loss rate of the links. The reliability rates are shared
among the controllers through a newly-designed East/West bound interface, to select the coordinator for
the whole network. This proposed method is called “Reliable Distributed SDN (RDSDN).” In RDSDN, the
failure of controllers is detected by the coordinator that may undertake a fast recovery scheme to re-
place them. The numerical results prove performance improvement achievable with the adoption of the
RDSDN and show that this approach performs better regarding failure recovery compared to methods
used in related research.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction and motivation
Software-Defined Networking (SDN) has recently emerged as a
novel paradigm to overcome the challenges related to the control
plane of modern communication networks [1,2]. The brain of the
control plane is the so-called SDN controller, which typically talks
with network devices through a Southbound Interface (SBI) such
as the OpenFlow protocol [3]. The control plane exposes some fea-
tures and APIs through the Northbound Interface (NBI) to network
operators to design various management applications exploiting,
for instance, a set of REST APIs [4,5]. The centralized control plane
approach of SDN promises controllable networks but raises a reli-
ability issue since the SDN controller may turn into a centralized
point of failure. This is a known issue, and several countermeasures
have been proposed. We have reviewed these works in Section 2.
In this article the goal is to consider the data plane and con-
trol plane reliability as a combined issue, proposing a solution that
combines network partitioning, controllers’ coordination, and data
plane reliability characteristics to enhance the overall network re-
silience.
*
Corresponding author.
E-mail addresses: moazzeni@eng.ui.ac.ir (S. Moazzeni), m.r.khayyambashi@comp.
ui.ac.ir (M.R. Khayyambashi), naserm@eng.ui.ac.ir (N. Movahhedinia),
franco.callegati@unibo.it (F. Callegati).
URL: http://eng.ui.ac.ir/~m.r.khayyambashi (M.R. Khayyambashi)
To reduce the effect of the data plane or controller failures,
it is assumed that a whole network domain can be partitioned
into subnetworks. Each subnetwork is controlled by a Master con-
troller and has one or more controllers of the other subnetworks
as Slave controllers. Each subnetwork’s Master controller calculates
the reliability rate by exploiting the newly proposed formula. The
reliability rates are shared periodically among controllers using
edge switches through a newly designed East/West bound inter-
face. There may be backup control routes in addition to the main
routes to improve fault coverage. The controller which has the best
reliability rate would be selected as the coordinator who checks
the status of the other controllers, periodically. This newly pro-
posed method is called “Reliable Distributed SDN (RDSDN)” which
aims to improve the reliability of SDNs with distributed controllers.
Through the detection phase, the coordinator detects any non-
active controller and will decide which other controller is more
appropriate to take over the subnetwork according to the cached
reliability rates and then will trigger the fast recovery scheme un-
til the failed controller is repaired. Therefore, the created inertia is
attenuated. If the coordinator crashes or a better controller exists,
a new one will be chosen by election.
The paper is organized as follows: A review of the most impor-
tant issues in SDN reliability and the related studies are presented
in Section 2. The main contribution containing the state-of-the-art
method for calculating the reliability rate and describing RDSDN
is in Section 3. The pilot implementation of our work, including
https://doi.org/10.1016/j.comnet.2018.01.023
1389-1286/© 2018 Elsevier B.V. All rights reserved.