66 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 1, JANUARY 2008 A Case Study of Hardware/Software Partitioning of Trafﬁc Simulation on the Cray XD1 Justin L. Tripp, Member, IEEE, Maya B. Gokhale, Fellow, IEEE, and Anders Å. Hansson, Member, IEEE Abstract—Scientiﬁc application kernels mapped to reconﬁg- urable hardware have been reported to have 10 to 100 speedup over equivalent software. These promising results suggest that reconﬁgurable logic might offer signiﬁcant speedup on applica- tions in science and engineering. To accurately assess the beneﬁt of hardware acceleration on scientiﬁc applications, however, it is necessary to consider the entire application including software components as well as the accelerated kernels. Aspects to be considered include alternative methods of hardware/software par- titioning, communications costs, and opportunities for concurrent computation between software and hardware. Analysis of these factors is beyond the scope of current automatic parallelizing compilers. In this paper, a case study is presented in which a simulation of metropolitan road trafﬁc networks is mapped onto a reconﬁgurable supercomputer, the Cray XD1. Five different methods are presented for mapping the application onto the com- bined hardware/software system. An approach for approximating the performance of each method is derived through analytic equa- tions. Our results, both analytically and empirically, show that key predictors of performance (which are often not considered in reported speedup of kernel operations) are not necessarily maximum parallelism, but must account for the fraction of the problem that runs on the reconﬁgurable logic and the amount data ﬂow between software and hardware. Index Terms—Hardware/software codesign, simulation, system integration. I. INTRODUCTION R ECONFIGURABLE coprocessing puts the extreme per- formance potential of programmable hardware to work on computationally intensive algorithms in science and engi- neering. Combining high performance microprocessors, large ﬁeld-programmable gate arrays (FPGAs), and low latency, high bandwidth interconnect, reconﬁgurable coprocessors have demonstrated 10–100 acceleration on compute-intensive scientiﬁc kernels, e.g., [1], [2]. Leading supercomputer vendors [3]–[5] offer machines that include programmable logic, and new software tools are appearing [6]–[8] that compile high level languages to hardware. Reconﬁgurable coprocessors offer the potential to improve computational performance by orders of magnitude. The Manuscript received May 22, 2006; revised July 16, 2007. This work was supported by Los Alamos National Laboratory, an afﬁrmative action/equal op- portunity employer, is operated by the Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under Contract DE-AC52-06NA25396. J. L. Tripp and A. Å. Hansson are with Los Alamos National Laboratory, Los Alamos, NM 87545 USA. M. B. Gokhale is with Lawrence Livermore National Laboratory, Livermore CA 94550 USA. Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TVLSI.2007.912126 use of hardware logic yields opportunities for improving coarse-grained (application-level), medium-grained (instruc- tion-level), and ﬁne-grained (operation-level) parallelism. First, FPGAs provide a large amount of highly parallel, conﬁgurable hardware resources, which makes it possible to create struc- tures, such as parallel multiply/add instructions, that greatly accelerate individual operations. Innermost loops can be un- rolled to expose additional instruction level parallelism. Loops can also be accelerated through careful scheduling of compute and memory access instructions and by loop pipelining. At a higher level, these parallel structures can themselves be repli- cated to create additional levels of parallelism, up to the limit of the target device capacity. Exposing and exploiting parallelism is critical to performance as FPGA clock rates are a factor of 10 slower than high performance microprocessors. While the hardware offers many opportunities for paral- lelism, the size and scale of scientiﬁc and engineering appli- cations present obstacles to reconﬁgurable supercomputing. Application programs are often several hundred thousand lines of code. The computationally intensive code segments must be located, and the code partitioned between software and hardware, with the kernel being rewritten either in a hardware description language (HDL) or a C dialect that can be compiled to hardware. Additionally, scientiﬁc applications are often dominated by 64-bit ﬂoating point computation, which consumes too much area and memory bandwidth on present-day FPGAs to be com- petitive with dedicated 64-bit ﬂoating point units on micropro- cessors. Computation with gigabyte memory arrays do not usu- ally ﬁt on current FPGA boards, thus requiring the commu- nication of large blocks of data between software and hard- ware. Often, the kernels that are amenable to hardware accel- eration cannot easily be overlapped with software computation and overall speedup is limited according to Amdahl’s Law [9]. It is usually the case that there are many different ways to partition the application, to implement the kernels in hardware, and to manage communication and synchronization. To exploit the full capacity of FPGA-based computing, it is essential to carefully select the portion(s) of the code to implement in hard- ware. Ideally, time spent in the code kernel(s) should dominate overall run time, and the code kernel(s) should be able to ex- ploit pipelining and replication, the sorts of spatial parallelism offered by FPGAs. The software portions of the algorithm may also need to be tuned to minimize communication time and to concurrently compute in both software and hardware. In this paper, we make four signiﬁcant contributions to the hardware/software partitioning problem. First, we present the mapping of a computationally intense application, metropolitan road trafﬁc simulation, onto a reconﬁgurable computer, the Cray U.S. Government work not protected by U.S. copyright.