Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C Tarek El-Ghazawi, Olivier Serres, Samy Bahra, Miaoqing Huang and Esam El-Araby Department of Electrical and Computer Engineering, The George Washington University {tarek, serres, sbahra, mqhuang, esam}@gwu.edu Abstract High-performance Reconfigurable Computers (HPRCs) integrate nodes of either microprocessors and/or field programmable gate arrays (FPGAs) through an interconnection network and system software into a parallel architecture. For domain scientists who lack the hardware design experience, programming these machines is near impossible. Existing high-level programming tools such as C-to- hardware tools only address designs on one chip. Other tools require the programmer to create separate hardware and software program modules. An application programmer needs to explicitly develop the hardware side and the software side of his/her application separately and figure out how to integrate the two in order to achieve intra-node parallelism. Furthermore, the programmer will have to follow that with an effort to exploit the extra-node parallelism. In this work, we propose unified parallel programming models for HPRCs based on the Unified Parallel C programming language (UPC). Through extensions to UPC, the programmer is presented with a programming model that abstracts hardware microprocessors and accelerators through a two level hierarchy of parallelism. The implementation is quite modular and capitalizes on the use of source-to- source UPC compilers. Based on the parallel characteristics exhibited at the UPC program, code sections that are amenable to hardware implementation are extracted and diverted to a C-to- hardware compiler. In addition to extending the UPC specifications to allow hierarchical parallelism and hardware-software co-processing, a framework is proposed for calling and using an optimized library of cores as an alternative programming model for additional enhancement. Our experimental results will show that the proposed techniques are promising and can help non-hardware specialists to program HPRCs with substantial ease while achieving improved performance in many cases. 1. Introduction High-Performance Reconfigurable Computers (HPRCs) are parallel architectures that integrate both microprocessors and field programmable gate arrays (FPGAs) into scalable systems that can exploit the synergism between these two types of processors. HPRCs have been shown to achieve up to several orders of magnitude improvements in speed [1], size, power and cost over conventional supercomputers in application areas that are of critical national interest such as cryptography, bio-informatics, and image processing [2]. The productivity of HPRCs, however, remains an issue due to the lack of easy programming models for this class of architectures. Application development for such systems is viewed as a hardware design exercise that is not only complex but may also be prohibitive to domain scientists who are not computer or electrical engineers. Many High-Level Languages (HLLs) have been introduced to address this issue. However, those HLLs address only a single FPGA [4], leaving the exploitation of the parallelism between resulting hardware cores and the rest of the resources, and scaling the solution across multiple nodes to the developer using brute force, with no tools to help. This has prompted the need for a programming model that addresses an entire HPRC with its different levels and granularity of parallelism. In this work, we propose to extend the partitioned global address space (PGAS) scheme to provide a programming model that presents the HPRC users with a global view that captures the overall parallel architecture of FPGA-based supercomputers. PGAS provides programmers with a global address space which is logically partitioned such that threads are aware of what data are local and what are not. Therefore PGAS programming languages are characterized with ease-of-use. Two different usage models are proposed and separately investigated and then integrated to provide application developers with easy access to the performance of such