Brief Announcement: How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining Andreas Dielacher RUAG Aerospace Austria andreas.dielacher@ruag.com Matthias Függer, Ulrich Schmid Technische Universität Wien {fuegger,s}@ecs.tuwien.ac.at Modern very-large scale integration (VLSI) circuits, in particular, systems-on-chip (SoC), have much in common with the loosely-coupled distributed systems that have been studied by the fault-tolerant distributed algorithms com- munity for decades [1, 4]. Recent work confirms that dis- tributed algorithms results and methods are indeed applica- ble in VLSI design, and, conversely, that results and meth- ods from VLSI design can also be applied successfully in the distributed algorithms context. Examples of the latter are error-correcting codes and pipelining, which is probably the most important paradigm for concurrency in VLSI design. In general, pipelining is applicable in case of streamed data processing, where a sequence of individual data items is to be processed by a sequence of x> 1 actions a1,...,ax, applied to every data item. Consider a chain of processors p1,...,px connected via storage elements (buffers), which provide the output of pi-1 as an input to pi . Instead of executing all actions by a single processor sequentially, every action ai is performed by processor pi here. Consequently, assuming a stream of data items to be processed, every single data item flows through the chain of processors similar as gas or water flows through a pipeline. Assuming that every action takes α seconds of processing time on its processor, the pipeline can process 1data items per second. This is a speed-up of a factor x over the at most 1/() data items per second that could be digested by a single processor executing all x actions sequentially. Pipelining is a well-known technique for speeding-up syn- chronous distributed algorithms. In this paper, 1 we demon- strate that pipelining is also effective for speeding-up asyn- chronous, fault-tolerant distributed algorithms in systems with large bandwidth×delay products. Basically, the idea is to just exploit the fact that any non-zero delay FIFO end- to-end data transmission/processing path stores the whole information in transit, and hence has an inherently pipelined architecture. A fault-tolerant distributed algorithm may hence immediately start phase k (rather than wait for the acknowledgments of the previous data processing phase k -1 in a “stop-and-go fashion”), provided that the acknowledg- ments for phase k - x (for some integer x> 1) have already been received from sufficiently many correct processes. If the system has at least x stages in the inherent pipeline of 1 This work has been supported by the FIT-IT project DARTS (809456) and the FWF projects P17757 and P20529. Copyright is held by the author/owner(s). PODC’09, August 10–12, 2009, Calgary, Alberta, Canada. ACM 978-1-60558-396-9/09/08. every end-to-end delay path, with stage delays δi and δete = P x i=1 δi , this allows to speed-up the processing throughput from 1ete up to 1s, where s is the slowest stage in the pipeline. We demonstrate the feasibility of this idea by providing a pipelined version of the DARTS fault-tolerant clock gen- eration approach for SoCs introduced in [3]. Instead of us- ing a quartz oscillator and a clock tree for disseminating the clock signal throughout the chip, DARTS clocks em- ploy a Byzantine fault-tolerant distributed tick generation algorithm (TG-Alg). The algorithm is based on a variant of consistent broadcasting [5], which has been adapted to the particular needs of a VLSI implementation [3]. Unfortu- nately, since the frequency of an ensemble of DARTS clocks is solely determined by the end-to-end delays along certain paths (which depend on the physical dimensions of the chip and hence cannot be made arbitrarily small), the maximum clock frequency is limited. For example, our first FPGA prototype implementation ran at about 24 MHz; our recent space-hardened 180 nm CMOS DARTS ASIC runs at about 55 MHz. Fortunately, pipelining comes as a rescue for fur- ther speeding-up the clock frequency here. Like DARTS, the pipelined pDARTS derives from a simple synchronizer for the Θ-Model introduced in [5]. Its pseudo- code description is given below; X is a system parameter determined by the inherent pipeline depth of the system. The algorithm assumes a message-driven system (where pro- cesses, i.e., TG-Alg instances, make atomic receive-compute- send steps whenever they receive a message), where at most f of n =3f + 1 processes may behave Byzantine. Correct processes are connected by a reliable point-to-point message- passing network (= TG-Net), with end-to-end delays within some (unknown) interval [τ - + ]. Let Θ = τ + - denote the maximum delay ratio. 1: VAR k: integer := 0 /* Local clock value */ 2: send tick(-X) ... tick(0) to all /* At booting time */ 3: if received tick() from f + 1 processes, with ℓ>k then 4: send tick(k + 1) ... tick() to all [once] 5: k := 6: if received tick() from 2f + 1 processes, with k - X then 7: send tick(k + 1) to all [once] 8: k := k +1 Our detailed analysis revealed that correct processes gen- erate a sequence of consecutive messages tick(k), k 1, in a synchronized way: If bp(t) denotes the value of the variable k of the TG-Alg at process p at real-time t, which gives the number of tick() messages broadcast so far, it turns out that (t2 - t1)αmin bp(t2) - bp(t1) (t2 - t1)αmax for any cor- rect process p and t2 - t1 sufficiently large (“accuracy”); the constants αmin and αmax depend on τ - , τ + and X. More-