2SWLPDO$XWRPDWLF+DUGZDUH6\QWKHVLV )RU6LJQDO3URFHVVLQJ$OJRULWKPV Nectarios Koziris, George Economakos, Theodore Andronikos, George Papakonstantinou and Panayotis Tsanakas National Technical University of Athens Dept. of Electrical and Computer Engineering Computer Science Division Zografou Campus, Zografou 15773, Greece e-mail:{nkoziris, papakon}@dsclab.ece.ntua.gr , ne of the most tedious tasks for a lot of sequential algorithms is the execution of nested FOR-loops with data dependencies among their computations. If a computation in one iteration, depends on a computation in another iteration, this dependence is presented as the vector difference of these two iteration indices. The majority of such algorithms present a regular vector pattern (uniform data dependencies). This means that the values of all dependence vectors are constants, i.e., they are independent of the indices of computations. A subclass of the class of uniform nested loops is the class of the unit dependence nested loops, where every dependence vector has zeroed or unit coordinates. Very important algorithms used in signal processing, such as matrix multiplication, LU decomposition, discrete Fourier transform, convolution and transitive closure fall into this category. In addition to this, even signal processing algorithms with non-uniform dependencies can be transformed into uniform ones [11]. Since dependence vectors describe computations’ flow, they are used to find the optimal parallel execution time. The widely used method is based on Lamport [7] who introduced the term “hyperplane”. The idea is to find a time schedule that partitions computations into different sets, which are called hyperplanes. All index points belonging to the same set can be executed concurrently. The major problem after having found a time schedule, is to organize computations in space, i.e., assign indexed computations to processors. Systolic arrays are widely used in signal processing because, due to their uniformity, they are suitable for massive parallelism and low cost implementation (see Kung [6]). One of the most difficult issues when using a systolic array is the efficient use of its cells. The regularity of the systolic array structure imposes serious obstacles in organizing computations efficiently and thus increasing the utilization of cells. Most of presented methods for mapping loop algorithms onto systolic arrays have poor cell utilization, and use exhaustive search-based mapping techniques [5], [6], [8], [9]. In this paper we apply the methodology presented in [2], on signal processing algorithms. In [4] we have implemented an integrated design tool for the optimal mapping of nested loops with unit dependencies on unbounded number of systolic cells. We did not only find an optimal time schedule for loop iterations, but we also assigned the concurrent iterations onto the least possible number of cells. In this paper we show that most of the signal processing algorithms can be automatically synthesized in hardware using the previously established analysis. Our integrated tool, accepts as input the nested loop specifications and produces optimal systolic designs for the subclass of loops with unit uniform dependencies. This tool integrates the methods for optimal time and space scheduling onto unbounded number of processors presented in [2], and produces VHDL descriptions for the resulting architecture. In particular, a VHDL preprocessor called GENVHDL has been implemented, which translates optimal scheduling and mapping results into VHDL code, which can be afterwards fed into VHDL entry CAD tools for synthesis and simulation (e.g. XILINX, WorkView Plus from VIEWlogic etc).