Scalable Memory Hierarchies for Embedded Manycore Systems Sen Ma, Miaoqing Huang, Eugene Cartwright, and David Andrews Department of Computer Science and Computer Engineering University of Arkansas {senma,mqhuang,eugene,dandrews}@uark.edu Abstract. As the size of FPGA devices grows following Moore’s law, it becomes possible to put a complete manycore system onto a single FPGA chip. The centralized memory hierarchy on typical embedded sys- tems in which both data and instructions are stored in the off-chip global memory will introduce the bus contention problem as the number of pro- cessing cores increases. In this work, we present our exploration into how distributed multi-tiered memory hierarchies can effect the scalability of manycore systems. We use the Xilinx Virtex FPGA devices as the test- ing platforms and the buses as the interconnect. Several variances of the centralized memory hierarchy and the distributed memory hierarchy are compared by running various benchmarks, including matrix multiplica- tion, IDEA encryption and 3D FFT. The results demonstrate the good scalability of the distributed memory hierarchy for systems up to 32 Mi- croBlaze processors, which is constrained by the FPGA resources on the Virtex-6LX240T device. Keywords: Distributed memory hierarchy, manycore architecture, em- bedded system. 1 Introduction Current FPGA densities have reached the Million LUT level, allowing a com- plete multiprocessor system on programmable chip (MPSoPC) to be configured within a single device. While FPGA density still lags CMOS ASIC’s, the mal- leability of the FPGA fabric provides system designers the flexibility in mixing and matching different types of processors and computational components, tai- lored to the requirements of each individual application. The use of FPGA’s as programmable multiprocessor systems on programmable chips instead of point design custom accelerators has been further enabled by the availability of nec- essary soft IP system components such as standard busses, soft processors with caches, and multi-port memory controllers. As an example, Xilinx’s Microblaze soft processor hosts several standard bus interconnections such as the Proces- sor Local Bus (PLB), XCL bus, Local Memory Bus (LMB), and a Multi-Port Memory Controller (MPMC). The MPMC enables the creation of a Symmetric Multiprocessor (SMP) shared memory architecture for up to 7 processors plus O.C.S. Choy et al. (Eds.): ARC 2012, LNCS 7199, pp. 151–162, 2012. c Springer-Verlag Berlin Heidelberg 2012