Software-Based Adaptive and Concurrent Self-Testing in Programmable Network Interfaces Yizheng Zhou, Vijay Lakamraju, Israel Koren, C.M. Krishna Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 01003 E-mail: {yzhou, vlakamra, koren, krishna}@ecs.umass.edu Abstract Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low- overhead failure detection technique, which is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed adaptive and concurrent self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. The paper shows how this technique can be made to minimize the performance impact on the host system and be completely transparent to the user. 1. Introduction Interfaces with a network processor and large local memory are widely used [14, 16, 17, 18, 19, 20, 21]. The complexity of network interfaces has increased tremendously over the past few years. This is evident from the amount of silicon used in the core of network interface hardware. A typical dual-speed Ethernet controller uses around 10K gates whereas a more complex high-speed network processor such as the Intel IXP1200 [22] uses over 5 million transistors. This trend is being driven by the demand for greater network performance, and so communication-related processing is increasingly being offloaded to the network interface. As transistor counts increase dramatically, single bit upsets from transient faults, which arise from energetic particles, such as neutrons from cosmic rays and alpha particles from packaging material, have become a major reliability concern [1, 2], especially in harsh environments [3, 4]. In most cases, the failure * This work has been supported in part by a grant from a joint NSF and NASA program on Highly Dependable Computing (NSF grant CCR- 0234363, NASA grant NNA04C158A). is soft, i.e., it does not reflect a permanent failure of the device, and typically a reset of the device or a rewriting of the memory cell returns the device to normal functioning. As we will see in the following sections, soft errors can cause the network interface to completely stop responding, function improperly, or even cause the host computer to crash/hang. Quickly detecting and recovering from such network interface failures is therefore crucial for a system requiring high reliability. We need to provide fault tolerance for not only the hardware in the network interface, but also the local memory of the network interface where the network control program (NCP) resides. In this paper, we present an efficient software-based failure detection technique for programmable network interfaces. Software-based fault tolerance approaches are attractive, since they allow the implementation of dependable systems without incurring the high costs of using custom hardware or massive hardware redundancy. On the other hand, software fault tolerance approaches impose overhead in terms of reduced performance and increased code size. Since performance is critical for high- speed network interfaces, fault tolerance techniques applied to them must have a minimal performance impact. Our failure detection scheme is based on a software- implemented watchdog timer to detect network processor hangs, and a software-implemented adaptive and concurrent self-testing technique to detect non-interface-hang failures, such as data corruption and bandwidth reduction. The proposed scheme achieves failure detection by periodically directing the control flow to go through program paths in specific portions of the NCP in order to detect errors that affect instructions or data in the local memory as well as other parts of the network interface. The key to our technique is that the NCP is partitioned into various logical modules and only the functionalities of active logical modules are tested (a logical module is defined as the collection of all basic blocks that participate in providing a service, and an active logical module is the one providing a service to a running application). When compared