A Fault Resilient Architecture for Distributed Cyber-Physical Systems Fardin Abdi Taghi Abad, Marco Caccamo Department of Computer Science University of Illinois at Urbana-Champaign Urbana-Champaign, USA {abditag2, mcaccamo}@ILLINOIS.EDU Brett Robbins Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Urbana-Champaign, USA robbins3@ILLINOIS.EDU Abstract—In this paper we discuss a general approach and architecture for design of distributed cyber-physical systems in order to make them resilient to communication faults. In this approach, each node exploits physical connections between nodes to estimate some of the state parameters of the remote nodes in order to detect the faults and also to maintain stability of system after fault occurrence. Finally, based on this architecture and approach, a fault-resilient decentralized voltage control algorithm is presented and evaluated. Keywords-Distributed cyber-physical systems; Power network; Unreliable communication; Fault Resilient; I. I NTRODUCTION Cyber-physical systems are a class of systems in which physical systems are in tight combination with computation elements and they feature a high level of coordination with each other. In other words, any system in which a physical device is controlled using a computer and/or is connected to a network of other computer controlled physical devices can be classified in this category. Power grids, water distribution networks, transportation systems, and heart pacemakers em- bedded in the body are all examples of such systems. Distributed cyber-physical systems, as an important sub- group of cyber-physical systems, can be defined as a set of interconnected computer controlled physical plants that physically affect each other. Which means, output of each distributed node is not only a function of its own control inputs and state variables, but also is a function of control inputs in other nodes of the system depending on connection graph of the system. Examples of this type of systems include power grids, water and waste-water distribution sys- tems, and traffic control system. High criticality of this class of systems is the main reason that maintaining safety and stability is the first priority for every system architect. Any accidental or malicious fault in any of the above components can lead to huge costs and irreparable damages. Electric grid is one well-known distributed cyber-physical systems that has two general control architectures; cen- tralized and decentralized. In centralized methods, all the nodes of the system are connected through communication channels to a central supervisory control and data acquisition center. In decentralized algorithms, each node only com- municates with only a subset of nodes without any central controller. As mentioned in [1] and [2], communication channels connecting controllers to each other play an essential role in maintaining functionality and stability of the system by enabling the distributed controllers to coordinate with other nodes and take control decisions that result in the desired state for the entire system. In [3], based on examination of 162 disturbances reported by the North American Electric Reliability Council (NERC) authors indicate that, "informa- tion system failures contribute to a very high percentage of large failures". Thus, fault tolerance, information security, and robustness of system communication design and imple- mentation are critical to cyber-physical system control [4]. Work has been done to provide more robust and se- cure communication for power grids [5]. In [6], ubiquitous TCP/IP protocols were deployed to provide reliable data delivery for a cyber-physical system. These protocols have unpredictable latency which can cause major problems for time-critical control decisions [7], [8]. While most of these efforts in communication networks have focused on preven- tion, there is not much work for system control methods under malicious or accidental fault situations [9]. In case of any data streaming disconnection or delayed packet arrival, controllers would have to make decisions based only on their local state and unaware of rest of the system. This will highly increase the chance of taking system into a non-safe state. Another issue with power grid infrastructure is that the technology currently being deployed by power grid commu- nication infrastructure belongs to few decades ago during which many of the current advances in the distributed computing was not even made. Due to underinvestment and heavy cost for transitioning to new communication solutions, deployment of new technologies in this infrastructure will not happen in a close future [10], [11]. Thus, an approach is needed to overcome the faults in communication without changing the current infrastructure. In this paper, we are trying to exploit some unique features of cyber-physical systems and propose an architecture for this purpose. We noticed that most of the previous works have not considered the dynamics of physical systems and how they can be used to detect compromised nodes or a fault in a com-