High-Level Synthesis of Multiple-Precision Circuits Independent of Data-Objects Length* M.C. Molina, J.M. Mendías, R. Hermida Dpto. Arquitectura de Computadores y Automática Universidad Complutense de Madrid Avda. Complutense s/n, 28040 Madrid (SPAIN) {cmolinap, mendias, rhermida}@dacya.ucm.es ABSTRACT This paper presents a heuristic method to perform the high-level synthesis of multiple-precision specifications. The scheduling is based on the balance of the number of bits calculated per cycle, and the allocation on the bit-level reuse of the hardware resources. The implementations obtained are multiple-precision datapaths independent of the number and widths of the specification operations. As a result impressive area savings are achieved in comparison with conventional algorithms implementations. Categories and Subject Descriptors B.5.2 [ Register-transfer-level Implementation]: Design Aids – automatic synthesis, optimization. General Terms Algorithms, Design. Keywords High-level synthesis, multiple-precision, scheduling, allocation. 1. INTRODUCTION Conventional high-level synthesis (HLS) scheduling algorithms set the execution cycle in which every specification operation starts its execution. They try to balance the number of operations of each different type scheduled per cycle, independently of their widths. Figure 1a) shows a possible scheduling provided by a conventional algorithm for a multiple-precision specification formed by three additions with no data dependencies. However, it is possible to obtain circuits with smaller area by uniformly distributing the number of bits calculated among the cycles. In this case, an operation may be executed during a set of non-necessarily consecutive cycles starting from the least significant bits. The carry-out produced in one cycle must be stored to be supplied as the carry-in during the next execution cycle of the operation. Figure 1b) shows another possible scheduling for the specification in which the least significant bits of the operation (A=B+C) are calculated in cycle i, and the remaining ones in cycle i+1. Conventional HLS allocation algorithms provide datapaths where every operation is executed over one functional unit (FU) of the same or greater width. The number of FUs is determined by the maximum number of operations scheduled in the same cycle, with their respective widths depending on the number of bits of the wider operations. When these wider operations are not scheduled in such busy cycle, a considerable waste of area may result. Figure 1c) shows the conventional solution for the scheduling in Figure 1a). However, it is possible to obtain circuits with smaller area by following other design strategies, like the simultaneous execution of several operations over one FU, or the execution of one operation over several linked FUs. Figures 1d) and 1e) show these novel solutions for the scheduling in Figure 1b). The argumentation made for operations and FUs can be extrapolated to variables and storage resources, and to data transfers and routing resources. In this paper we propose an algorithm able to offer solutions like the ones presented above. These scheduling and allocation methodologies lead to smaller datapaths where the number and widths of the HW resources are independent of the number and widths of the specification operations and variables. 2. RELATED WORK Multiple-precision behaviours appear both in SW and HW, especially in the development of DSP applications. The problem to be solved in DSP SW development consists in the transformation of the original multiple–precision specification into another one with a unique width. This obliges to use the truncation and extension operators to make uniform the operations widths, with the consequent loss of precision in the first case, and waste of HW in the second one. Recent research in multiple- precision SW derives fixed-point implementations from floating- point or infinite-precision descriptions [1] [8]. Within the HW field there has been little research in HLS for multiple-precision specifications. Up to now most systems have adopted trivial solutions, balancing the number of operations executed per cycle, and binding operations to FUs with identical widths [4] [5]. More efficient systems execute operations over wider FUs, extending the operands when necessary, and discarding some bits of the solutions produced [2] [3]. Nevertheless, some authors admit that these trivial solutions are not good enough and suggest (but not implement) other alternatives, like the execution of an operation during several cycles [3]. * Supported by Spanish Government Grant CICYT TIC-99 0474 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2002 , June 10-14, 2002, New Orleans, Lousiana, USA. Copyright 2002 ACM 1-58113-461-4/02/0006…$5.00.