High speed self-timed pipelined datapath for square rooting G.Cccno G.Cocolo Coronelo .err tt The authors describe a new high performance self-timed circuit for asynchronous square rooting. The new architecture is based on a modified nonrestoring algorithm. An asynchronous pipclincd cellular array without auxiliary system for the identifcation of exceptions will be demonstrated. The self-timing approach allows the whole performance to be greatly improved with respect to synl:hronous implementation, causing acceptable area over heads. 1 Introduction Square-root computations, for continuous operation, constitute a computational bottleneck in special-pur pose hardware, such as in real-time image processors and geometric-transform operators. In the last few years, many square root algorithms have been studied and full parallel synchronous hardware implementa tionshavebeenproposed. As is well known, the synchronous design approach assumes that all circuit events are orchestrated by a central clock. and its period has to be larger than the worst-case delay of the slowest module. Very long clock connections have lo be managed, wilh conse quent power-consuming buffers and unavoidable clock skew. Moreover, standard synchronous circuits have to toggle clock lines in the unused portion ofthe circuit in the current computation, producinguselesspowerdissi pation. In these systems, detection of idle blocks and shutting down and restarting high-speed clocks imply unacceptable hardware andtimeoverheads. On the other hand, the asynchronous design approach is characterised by: local synchronisation; average case performances of the combinational mod ules used in the circuit; and efective shutdown of the unused modules during computation. Obviously, these advantages are not free; they come at the expense of  EE, 1999 lEE Proceedings onlineno. 1999027 1  10.1049!ip-cds:19990271 Paperfrst receved 24th Marehand  revised23rdSeptember 1998 G. Cappino, G. Coeorullo  . Perri are with the Department of Elctronies,Computer SeienccandSystes,UnversityofCalabria-Aca vacatadi Rende, 87036. Rende(CS),Ial G. Cocorlo is also with IRECE. National Council of Research - Via Dic1eiao328,80125,Napi ,Italy P. Corsonello is with the Departent of Electronic Enneering and Applied Maematics, University ofReggio , Calabia-Lc Feo de Vito, 89060 Regio, Calabria,Italy 16 the silicon area, which is caused by the handshaking logic and end-completion-sensing modules. Moreover, asynchronous systems are more difficulttodesign than synchronous ones. In fact, the designer must pay a great deal of attention to the dynamic state of the cir cui. Neertheless, complex asynchronous systems, such as digital signal processors or microprocessors, have recently been demonstrated [I]. In many cases, the asynchronous design approach has been chosen because it reduces power dissipation and, consequently, reduces thermal problems. In many other applications, self timing can greatly improve per formance without signifcantly decreasing power dissi pation. For example, this happens in the pipelined data path that ofen runs in the continuous-operation mode. In these cases, idletime neverexistsfor thecom putational modules. However, self timing could speed up the circuits, especially when each of the several stages of the pipelined datapathcomputesits output in adifferenttime. In fact, let N be the number of stages constituting a generic pipelined datapath. Let '1, ' 20 ... , 'N and 'av" 'av2' ..., 'aN betheworst-case delayandthe average-lase delay of the several stages, respectively. Using the synchronous approach, the designer will obtain a circuit with a latency equal to N*'clk and a throughput rate of llrclk, where ' > max('I, '2, ... , 'N)' The same circuit designed in asynchronous fashion ill c kmpute the oUlpuls after an avtrage lalency equal to  i- 'a i at an average throughput rate of li'/Ii =  max( a, Ta2 . .. , �/vN)' The circuit known to be implemented, to establish whether the asynchronous approach is more convenient than the synchronous one, is nota trivial problem. The designer must assurehimself that the advantage due to the average-speed computation of the modules is not annulled by the time overheads due to handshaking and completion-detection circuitry. Further to this, power consumption due to handshaking and comple tion-sensing modules mustbe taken into account. Many efcient proposals of self-timed adders, multi pliers and dividers arc present inliterature. This paper deals with a new self-timed pipelined cellular array for square rooting. The circuit is based on a modified non restoringalgorithmpreviously demonstrated. The asyn chronous-design approach allows general performance to be improved with respect to the synchronous sol tion, with an acceptable area overhead. 2 Background to the algorithm and its synchronous implementation Nonrestoring square root algorithms are based on a step-by-step result digit production by inspecting he EE Proc.-Cir(uits Dt'vices 5I'S., Vol. 146, No.1. February 1999