Assessing Risk for Network Resilience Marcus Sch¨ oller NEC Europe Ltd., Heidelberg, Germany Email: Marcus.Schoeller@neclab.eu Paul Smith and David Hutchison Lancaster University, Lancaster, UK Email: {p.smith, dh}@comp.lancs.ac.uk Abstract—Communication networks and the Internet, in par- ticular, have become a critical infrastructure for daily life, busi- ness and governance. Various challenging conditions can render networks or parts thereof unusable, with severe consequences. Protecting a network from all possible challenges is infeasible because of monetary, hardware and software constraints. Hence, a methodology to measure the risk imposed by the various challenges threatening the system is a necessity. In this paper, we present a risk assessment process to identify the challenges with the highest potential impact to a network and its users. The result of this process is a prioritised list of challenges and associated system faults, which can guide network engineers towards the mechanisms that have to be built into the network to ensure network resilience, whilst meeting cost constraints. Furthermore, we discuss how outcomes from the intermediate steps of our risk assessment process can be used to inform network resilience design. A better understanding of these aspects and a way to determine reliable measures are open issues, and represent important new research items in the context of resilient and survivable networks. I. I NTRODUCTION Multi-service networks are essential for business, social life and entertainment. Sensor-actor networks provide the basis for a huge variety of deployments, e.g., industrial production, home automation, alarm systems, and environmental moni- toring to name a few. Control networks build the backbone for remote operation of facilities and transportation. In short, communication networks have become critical for our busi- ness, social life, and governmental operation. This implies that failures of such networks, or parts of them, can have severe consequences. Our work is based on the understanding that we can neither build fault-free networks, nor can we forecast all pos- sible challenges to a network deployment. Multiple strategies have been proposed to design networks to survive different types of challenges. Examples include contributions from the ANSI/ATIS T1A1.2 Working group [1], ANSA [2], and Sterbenz et al., as part of the ResiliNets initiative [3]. Central to all of them is the architectural view that challenges, as external events, trigger dormant faults of network systems’ services, which manifest as errors. If these errors cannot be contained within the challenged service, they lead to a deviation of delivered service outside of acceptable bounds – a failure. The acceptable service bounds are defined in terms of dependability, security, and quality of service (QoS). The variety of challenges, which can degrade the delivered service, is large: hardware destruction, communication environment related challenges, human mistakes, cyber attacks, unusual but legitimate request for service, and failure of a service provider. Based on this understanding, two components of a resilient network design can be derived. First, preventing challenges from affecting the system at all and second, to isolate erro- neous behaviour within a service instance by building con- tainment mechanisms. However, it is often not clear what is the suitable set of prevention and isolation mechanisms for a given network context. This is a problem given the potentially limited resources set aside for ensuring network resilience. In this paper, we propose a risk assessment process for network resilience that aims to identify the challenge-fault pairs that are likely to have the highest impact on a network stakeholder’s assets. This information can be used to make informed decisions about the nature and configuration of protection and isolation mechanisms, and increases overall network resilience within cost constraints. The process we outline is similar to those proposed in the information security domain, but includes novel aspects that are necessary to consider network resilience matters. Unlike for information security, losing an asset is not binary in net- work resilience. Service disruptions, i.e., loss of connectivity or reduced bandwidth, are often acceptable within defined bounds. Based on previous work, we show how to associate costs of compromise to various states of network service. We highlight how information generated as part of our risk assessment process, in addition to the high-impact challenge- fault pairs, can be used by network engineers to ensure resilience. For example, using information about the priority of assets can be used to guide remedy selection in resource constrained environments, such as sensor networks or during a resource starvation attack. We suggest that service dependency graphs can be used as the basis for challenge-independent remedies, which can mitigate unforeseen challenges. The rest of the paper is structured as follows: we describe the state-of-the-art in risk management processes in Section II. Afterwards, in Section III, we introduce our risk assessment strategy for resilience. To demonstrate the applicability of our approach, we apply it to determining the high-impact challenges in a community wireless mesh network. There are a number of open issues for future investigation, including determining appropriate values, e.g., probabilities of challenge occurrence, to be used at various points in the process; we describe these in Section IV. 978-963-8111-77-7