66 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 1, JANUARY 2008 A Case Study of Hardware/Software Partitioning of Traffic Simulation on the Cray XD1 Justin L. Tripp, Member, IEEE, Maya B. Gokhale, Fellow, IEEE, and Anders Å. Hansson, Member, IEEE Abstract—Scientific application kernels mapped to reconfig- urable hardware have been reported to have 10 to 100 speedup over equivalent software. These promising results suggest that reconfigurable logic might offer significant speedup on applica- tions in science and engineering. To accurately assess the benefit of hardware acceleration on scientific applications, however, it is necessary to consider the entire application including software components as well as the accelerated kernels. Aspects to be considered include alternative methods of hardware/software par- titioning, communications costs, and opportunities for concurrent computation between software and hardware. Analysis of these factors is beyond the scope of current automatic parallelizing compilers. In this paper, a case study is presented in which a simulation of metropolitan road traffic networks is mapped onto a reconfigurable supercomputer, the Cray XD1. Five different methods are presented for mapping the application onto the com- bined hardware/software system. An approach for approximating the performance of each method is derived through analytic equa- tions. Our results, both analytically and empirically, show that key predictors of performance (which are often not considered in reported speedup of kernel operations) are not necessarily maximum parallelism, but must account for the fraction of the problem that runs on the reconfigurable logic and the amount data flow between software and hardware. Index Terms—Hardware/software codesign, simulation, system integration. I. INTRODUCTION R ECONFIGURABLE coprocessing puts the extreme per- formance potential of programmable hardware to work on computationally intensive algorithms in science and engi- neering. Combining high performance microprocessors, large field-programmable gate arrays (FPGAs), and low latency, high bandwidth interconnect, reconfigurable coprocessors have demonstrated 10–100 acceleration on compute-intensive scientific kernels, e.g., [1], [2]. Leading supercomputer vendors [3]–[5] offer machines that include programmable logic, and new software tools are appearing [6]–[8] that compile high level languages to hardware. Reconfigurable coprocessors offer the potential to improve computational performance by orders of magnitude. The Manuscript received May 22, 2006; revised July 16, 2007. This work was supported by Los Alamos National Laboratory, an affirmative action/equal op- portunity employer, is operated by the Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under Contract DE-AC52-06NA25396. J. L. Tripp and A. Å. Hansson are with Los Alamos National Laboratory, Los Alamos, NM 87545 USA. M. B. Gokhale is with Lawrence Livermore National Laboratory, Livermore CA 94550 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2007.912126 use of hardware logic yields opportunities for improving coarse-grained (application-level), medium-grained (instruc- tion-level), and fine-grained (operation-level) parallelism. First, FPGAs provide a large amount of highly parallel, configurable hardware resources, which makes it possible to create struc- tures, such as parallel multiply/add instructions, that greatly accelerate individual operations. Innermost loops can be un- rolled to expose additional instruction level parallelism. Loops can also be accelerated through careful scheduling of compute and memory access instructions and by loop pipelining. At a higher level, these parallel structures can themselves be repli- cated to create additional levels of parallelism, up to the limit of the target device capacity. Exposing and exploiting parallelism is critical to performance as FPGA clock rates are a factor of 10 slower than high performance microprocessors. While the hardware offers many opportunities for paral- lelism, the size and scale of scientific and engineering appli- cations present obstacles to reconfigurable supercomputing. Application programs are often several hundred thousand lines of code. The computationally intensive code segments must be located, and the code partitioned between software and hardware, with the kernel being rewritten either in a hardware description language (HDL) or a C dialect that can be compiled to hardware. Additionally, scientific applications are often dominated by 64-bit floating point computation, which consumes too much area and memory bandwidth on present-day FPGAs to be com- petitive with dedicated 64-bit floating point units on micropro- cessors. Computation with gigabyte memory arrays do not usu- ally fit on current FPGA boards, thus requiring the commu- nication of large blocks of data between software and hard- ware. Often, the kernels that are amenable to hardware accel- eration cannot easily be overlapped with software computation and overall speedup is limited according to Amdahl’s Law [9]. It is usually the case that there are many different ways to partition the application, to implement the kernels in hardware, and to manage communication and synchronization. To exploit the full capacity of FPGA-based computing, it is essential to carefully select the portion(s) of the code to implement in hard- ware. Ideally, time spent in the code kernel(s) should dominate overall run time, and the code kernel(s) should be able to ex- ploit pipelining and replication, the sorts of spatial parallelism offered by FPGAs. The software portions of the algorithm may also need to be tuned to minimize communication time and to concurrently compute in both software and hardware. In this paper, we make four significant contributions to the hardware/software partitioning problem. First, we present the mapping of a computationally intense application, metropolitan road traffic simulation, onto a reconfigurable computer, the Cray U.S. Government work not protected by U.S. copyright.