A Communication Scheduling Algorithm For Multi-FPGA Systems * Jinwoo Suh, Dong-In Kang, and Stephen P. Crago University of Southern California Information Sciences Institute 4350 N. Fairfax Drive, Suite 770, Arlington, VA 22203 {jsuh, dkang, crago}@isi.edu * Effort sponsored by Defense Advanced Research Projects Agency (DARPA) through the Air Force Research Laboratory, USAF, under agreement numbers F30602-99-1-0521, F30602-97-1-0222, and F33615-98-C-1320. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not interpreted as necessarily representing the official policies or endorsement, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), Air Force Research Laboratory, or the U.S. Government. Abstract For multiple FPGA systems, the limited number of I/O pins causes many problems. To solve these problems, efficient communication scheduling among FPGAs is crucial for obtaining high CLB utilization. In this paper, we provide a heuristic for the NP-complete scheduling algorithm. The experimental results show that our algorithm generates excellent communication schedules: more than 90% of the randomly generated problem instances were scheduled with less than 20% overhead compared with an optimal algorithm. The execution time of the scheduling algorithm is two orders of magnitude less than the optimal scheduling algorithm. 1. Introduction In this paper, we discuss communication scheduling on multiple FPGA systems. Since the number of CLBs is proportional to the chip area (O(l 2 )) while the number of I/O pads is proportional to the chip perimeter (O(l)), where l is the length of a chip, the relative I/O bandwidth is getting smaller as the number of CLBs has been increasing continuously. Underutilization of I/O bandwidth exasperates the problem. To support a given I/O rate, pin bandwidth underutilization increases the number of FPGAs needed. This increase in the number of FPGAs causes the parts cost to increase and longer development time[1]. To address these problems, we propose an efficient communication algorithm that uses the available I/O pins efficiently to increase CLB utilization. 2. Problem Definition An application is partitioned into N tasks. Each task, τ v, has a computation time, T(τ v ), 0 ≤ v ≤ N -1. Each task is mapped to an FPGA, F i , 0 ≤ i ≤ F-1, where F is the number of FGPAs. If a task, τ v , on F j needs data from a task, τ u , 0 ≤ u ≤ N-1, on F i (i ≠ j), then communication must be performed between F i and F j before τ v can be performed. The communication takes T(e u,v ), where e u,v is the required communication between τ u and τ v . The communication is performed through the I/O pins of the FPGAs. Let us denote a set of I/O pins that is used for the communication as a channel, C x , 0 ≤ x ≤ C-1, where C is the number of channels. A channel cannot be used by two communications simultaneously and is non-preemptive. For simplicity, we assume that the channels are unidirectional. The object is to find a communication schedule that has minimum latency for given tasks and channels. Theorem 2-1: The scheduling of communication for a multiple FPGA system is NP-complete. The theorem can be proved by reducing it to flowshop scheduling[2]. The proof is omitted due to space limitations. 3. Our Algorithm In this algorithm, a heuristic weight value w(τ i ), w(e i,j ) of a task τ i , and an edge e i,j are used to determine the priority of the communication edge in scheduling. When multiple communication tasks compete for a communication channel at a given time, the one having the largest weight value is chosen for scheduling. The algorithm is shown in Figure 1. The algorithm consists of two steps: (i) evaluation of weights of tasks in breadth- first search fashion (Calculate_Weight) and (ii) scheduling of communication edges in each channel in bottom-up fashion (Largest_Weight_First). Algorithm Calculate_Weight(τ i ) (1) If visited(τ i ) = True (2) Return;