Online Clock Routing in Xilinx FPGAs for High-Performance and Reliability Xabier Iturbe *† , Khaled Benkrid * , Raul Torrego , Ali Ebrahim * and Tughrul Arslan * * System Level Integration Group, The University of Edinburgh, Edinburgh EH9 3JL, Scotland, UK {x.iturbe, k.benkrid, a.ebrahim, t.arslan}@ed.ac.uk Embedded System-on-Chip Group, IKERLAN-IK4 Research Alliance, Mondrag´ on 20500, Basque Country (Spain) {xiturbe, rtorrego}@ikerlan.es Abstract—In this paper, we report the design and implemen- tation of a reconfigurable system that exploits regional clocking resources that exist in Xilinx Virtex-4 FPGAs for increased performance and, for the first time, enhanced reliability. Unlike previous approaches, our system is able to individually manage the regional clock buffers (BUFRs) to adjust the frequency delivered to each hardware task and to detect and recover from faults affecting the clock-tree on-the-fly. Towards this end, we propose global and regional clock multiplexers, named GCMUX and RCMUX respectively, which allow for switching to spare clocking resources whenever needed. These multiplexers are based on the inner programmable interconnection points of the FPGA, leading to zero area overheads. I. I NTRODUCTION State of the art trends in Reconfigurable Computing (RC) envision substantial gains in performance, reliability and power efficiency over traditional systems by customizing at runtime the underlying architecture of the system to match the specific needs of a given application [1]. Different pieces of circuitry, specifically designed for efficiently implementing each type of computation, are allocated on a dynamically reconfigurable FPGA, executed and finally replaced by other circuits, leading to a continuous stream of input operands, computation and output results. Analogously to the software field, these swappable pieces of circuitry are named as hard- ware tasks. Current research efforts try to improve the performance of RC in each of the domains where computation occurs, in space and time [2]. In the space domain, they are aimed to increase the allocatability of the hardware tasks, reducing the fragmentation in the chip. In the time domain, the efforts con- centrate in exploiting the parallelism delivered by the FPGA. This includes both process-level parallelism or multitasking, where the objective is to make the highest amount of hardware tasks run simultaneously, and data-level parallelism, where the objective is to build efficient architectures able to exploit the high-bandwidth offered by the tens of thousands logic blocks and memories included in modern FPGAs. However, these approaches can highly benefit from ad- vances in precisely the key factor for computing speed in traditional processors: the clock frequency. Indeed, the highest clock frequency a hardware task can run depends on the maximum delay between its sequential components, the so- called longest path, which usually depends on the complexity of the task itself. As a result, several hardware tasks which can run at different clock frequencies are typically found in an RC application. Clocking the system at the slowest rate is the easy option that is often chosen, but it is not the most efficient. Hence, the open question here is, how to deal with frequency-heterogeneity to improve the performance of an RC application? Another issue of concern in RC is fault-tolerance. Re- searchers have traditionally focused on scrubbing bit upsets [3] and reconfiguring around damaged resources [4], but little attention has been put on common-source failures such as clocking distribution. However, the clock-tree is a single-point of failure and must be carefully hardened to increase the reliability of the system [5]. Modern families of Xilinx FPGAs include enhanced clock- ing capabilities that can be useful when dealing with the above-mentioned problems. These devices permit to indepen- dently handle different portions of the device’s reconfigurable area, the so-called clock-regions [7]. Each clock region in- cludes specific clocking resources and thus, the clock-tree is not a single resource that must be managed as a whole. Instead it is divided into several branches that feed the hardware tasks. However, most of RC systems do not still exploit these capabilities offered by new FPGAs. This paper is aimed at exploiting the multi branched clock- tree to increase performance and reliability in an RC system. The work reported in this paper is part of a larger effort in our group which aims to implement OS-like support to develop reconfigurable applications using Xilinx FPGAs; e.g. support for task scheduling, allocation, deallocation, inter-task commu- nications and synchronization, etc. Our OS is named as Re- liable Reconfigurable Real-Time Operating System (R3TOS) [6], emphasizing its three major features: reconfigurability, reliability and real-time performance. The main contributions of this paper are twofold: The implementation of an RC system able to manage the regional clocking resources at runtime to make each hardware task run at its maximum frequency. A novel method to detect and recover at runtime from a fault affecting the clock-tree.