FPGA-Based Prototype of a PRAM-On-Chip Processor Xingzhi Wen , Uzi Vishkin University of Maryland Institute for Advanced Computer Studies (UMIACS) Electrical and Computer Engineering, University of Maryland College Park, Maryland, USA hsmoon@umd.edu, vishkin@umd.edu ABSTRACT PRAM (Parallel Random Access Model) has been widely regarded a desirable parallel machine model for many years, but it is also believed to be “impossible in reality.” As the new billion-transistor processor era begins, the eXplicit Multi-Threading (XMT) PRAM-On-Chip project is attempt- ing to design an on-chip parallel processor that efficiently supports PRAM algorithms. This paper presents the first prototype of the XMT architecture that incorporates 64 sim- ple in-order processors operating at 75MHz. The micro- architecture of the prototype is described and the perfor- mance is studied with respect to some micro-benchmarks. Using cycle accurate emulation, the projected performance of an 800MHz XMT ASIC processor is compared with AMD Opteron 2.6GHz, which uses similar area as would a 64- processor ASIC version of the XMT prototype. The results suggest that an only 800MHz XMT ASIC system outper- forms AMD Opteron 2.6GHz, with speedups ranging be- tween 1.57 and 8.56. Categories and Subject Descriptors C.1.4 [Parallel Architectures] General Terms Algorithms Design Performance Keywords Parallel Algorithms, PRAM, On-chip parallel processor, Ease- of-programming, Explicit multi-threading, XMT 1. INTRODUCTION The eXplicit Multi-Threading 1 (XMT) on-chip general- purpose computer architecture is aimed at the classic goal 1 Partially supported by NSF grant CCF- 0325393. The XMT home page is at: umi- acs.umd.edu/users/vishkin/XMT Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CF’08, May 5–7, 2008, Ischia, Italy. Copyright 2008 ACM 978-1-60558-077-7/08/05 ...$5.00. of reducing single task completion time. XMT is a paral- lel algorithmic architecture in the sense that: (i) it seeks to provide good performance for parallel programs derived from Parallel Random Access Machine/Model (PRAM) al- gorithms, and (ii) it offers methodology for advancing from PRAM algorithms to XMT programs, along with a perfor- mance metric and its empirical validation [27]. Ease of paral- lel programming is now widely recognized as the main stum- bling block for extending commodity computer performance growth (e.g., using multi-cores). XMT provides a unique answer to this challenge. A 64-processor, 75MHz computer based on field-programmable gate array (FPGA) technology was built at the University of Maryland (UMD). A brief an- nouncement [29], which reported this first commitment to silicon of XMT, was a preamble to the current paper. Six additional kernel benchmarks are added to the test suite and the performance of an 800MHz XMT ASIC version is projected using cycle accurate emulation. The XMT con- cept was introduced in [28]. An architecture simulator and speed-up results on several kernels were reported in [21]. The new computer is a significant milestone for the broad PRAM-On-Chip project at UMD. In fact, contributions in the current paper include several stages since SPAA’01 [21]: completion of the design using a hardware description lan- guage (HDL), synthesis into gate level netlist, as well as validation of the design in real hardware. Discussion of the broader goals of XMT are deferred to the closing section at the end of this article. These goals are to address the current need for a general-purpose on-chip par- allel computer architecture, which: (i) is easy to program; (ii) gives good performance with any amount of parallelism provided by the algorithm; namely, up-and down-scalability including backwards compatibility on serial code; (iii) sup- ports application programming (in standard application lan- guages, such as VHDL/Verilog, OpenGL, MATLAB); and (iv) fits current chip technology and scales with it. PRAM The PRAM virtual model of computation is a generaliza- tion of the Random Access Machine (RAM) model, the ba- sic sequential model exposed to programmers in traditional programming languages, that assumes that any memory ac- cess or any (logic, or arithmetic) operation takes unit time. The PRAM assumes that any number of concurrent accesses to a shared memory take the same time as a single ac- cess. In the Arbitrary Concurrent-Read Concurrent-Write (CRCW) PRAM concurrent access to the same memory lo- cation for reads or writes are allowed. Reads are resolved