738 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 1998 An Efficient Algorithm for Performance-Optimal FPGA Technology Mapping with Retiming Jason Cong and Chang Wu Abstract— It is known that most field programmable gate array (FPGA) mapping algorithms consider only combinational circuits. Pan and Liu [22] recently proposed a novel algorithm, named SeqMapII, of technology mapping with retiming for clock period minimization. Their algorithm, however, requires run time and space for sequential circuits with gates. In practice, these requirements are too high for targeting -lookup-table-based FPGA’s implementing medium or large designs. In this paper, we present three strategies to improve the performance of the SeqMapII algorithm significantly. Our algorithm works in run time and space, where is the number of labeling iterations and is the size of the partial flow network. In practice, both and are less than . Area minimization is also considered in our algorithm based on efficient low-cost -cut computation. Index Terms—Expanded circuit, field programmable gate ar- ray (FPGA), lookup table, retiming, technology mapping. I. INTRODUCTION T HE technology mapping and synthesis problem for field programmable gate array (FPGA’s) is to produce an equivalent circuit for a given circuit using only specific pro- grammable logic blocks (PLB’s). More specifically, without synthesis, the PLB’s in a mapping solution form a cover of gates in the original circuit possibly with overlap. There are a variety of different PLB architectures. In this paper, we consider a generic type of PLB: the -input lookup table ( -LUT), which has been widely used in current FPGA technology [1], [18], [30]. Most of the previous LUT mapping algorithms optimize either area (e.g., [13], [14], and [20]) or delay (e.g., [5], [15], and [21]). The algorithms in [4] and [7] consider both delay and area. The algorithms in [25] and [27] consider the routability. A comprehensive survey of FPGA mapping algorithms is given in [6]; however, most of these approaches apply only to combinational circuits. For sequential circuits, these approaches assume that the positions of flip-flops (FF’s) are fixed so that the entire circuit can be partitioned into combinational subcircuits, each of which is mapped separately. A major limitation of these approaches is that they do not consider mapping and retiming simultaneously. In fact, the optimal mapping solutions for all Manuscript received March 11, 1997. This work was supported in part by the National Science Foundation under Young Investigator Award MIP9357582 and by grants from Xilinx and Lucent Technologies under the California MICRO program. This paper was recommended by Associate Editor A. Saldanha. The authors are with the Computer Science Department, University of California, Los Angeles, CA 90095 USA. Publisher Item Identifier S 0278-0070(98)06759-1. combinational subcircuits may not lead to an optimal mapping solution for the entire sequential circuit due to the effect of retiming. Retiming is a technique of moving FF’s within the circuit without changing the circuit behavior. For single-phase clock and edge-triggered FF’s, Leiserson and Saxe [16], [17] solved the retiming problem of minimizing the clock period or the number of FF’s. Several FPGA synthesis and mapping algorithms have been proposed specifically for sequential circuits. The approach in [19] does not consider retiming, but rather, its objective is to consider proper packing of LUT’s with FF’s to minimize the number of configurable logic blocks for Xilinx FPGA’s [30]. The methods in [23] and [29] are heuristics that consider loopless sequential circuits. Touati et al. [28] proposed an approach of retiming specifically for Xilinx FPGA’s after mapping, placement, and routing. A significant advancement was made recently by Pan and Liu [22]. They proposed a novel algorithm, named SeqMapII, to find a mapping solu- tion with the minimum clock period under retiming. Similar to the FlowMap algorithm [5], their algorithm works in two phases: the labeling phase and the mapping generation phase. They introduced the idea of expanded circuits to represent all possible -LUT’s under retiming and node- replication. An iterative method is used to compute labels for all nodes. The time and space complexities for SeqMapII are and , respectively, for a circuit with gates [22]. 1 Although the SeqMapII algorithm runs in polynomial time, it has two shortcomings: 1) too many candidate values ( ) need to be considered for each label update and 2) the expanded circuits are too large ( nodes) for computing the optimal solutions. Experimental results show that the run time of SeqMapII for computing the optimal solutions is too long in practice (e.g., more than 12 h of CPU time for a design of 134 gates on a SPARC5 workstation). In this paper, we present three strategies to improve the performance of the label computation significantly, which is the most time-consuming step in SeqMapII [22]. First, we prove that the monotone property of labels holds for sequential circuits, then develop an efficient label update to speed up the algorithm by a factor of . Second, we propose a new approach of -cut computation on partial flow networks, which are much smaller than the expanded circuits used in SeqMapII, while guaranteeing the optimality of the results. 1 The authors of [22] later reduced the time complexity of SeqMapII to [24] using the monotone property to be presented in Section IV of this paper (first presented in [10]). 0278–0070/98$10.00 1998 IEEE