1 Co-Exploration of NLA Kernels and Specification of Compute Elements in Distributed Memory CGRAs Mahesh Mahadurkar * , Farhad Merchant * , Arka Maity , Kapil Vatwani , Ishan Munje , Nandhini Gopalan * , S K Nandy * , and Ranjani Narayan * CADLab, Indian Institute of Science, Bangalore, India 560012 {maheshm@cadl, farhad@cadl, nandy@serc, nandhini@cadl}.iisc.ernet.in Morphing Machines Pvt. Ltd. ranjani.narayan@morphingmachines.com, arka.maity09@gmail.com BITS Pilani {kapilv14,munje.ishan}@gmail.com Abstract—Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi-core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance computing applications. In this paper we propose a systematic methodology to obtain the specification of Compute Elements (CE) for such CGRAs. We analyze block Matrix Multiplication and block LU Decomposition algorithms in the context of a CGRA, and obtain theoretical bounds on communication requirements, and memory sizes for a CE. Support for high performance custom computations common to NLA kernels are met through custom function units (CFUs) in the CEs. We present results to justify the merits of such CFUs. Index Terms—CGRA, numerical linear algebra, computation, parallelism I. I NTRODUCTION Compute intensive applications like guidance and control, SONAR beam-forming and other numerically intensive appli- cations like CFD, Molecular dynamics, etc. [1][2][3][4] are traditionally realized on high performance computing plat- forms. Numerical Linear Algebra (NLA) kernels like Matrix Multiplication (MM), LU and Cholesky Factorization and QR Decomposition (QRD) are key to these applications. NLA kernels comprise double precision Floating Point Arithmetic. In addition, achieving lower bounds for communication and storage has always been a challenge [5][6]. Further scalability of such systems is limited by the power constraints. Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as embedded accelerated processing units for a new class of platforms for Exascale computing. CGRAs often provide the ability to customize certain attributes of functional units con- tained within Processing Elements (PEs) or Compute Elements (CEs). In this paper we restrict ourselves to CGRAs that are distributed memory multi-core compute elements on a chip. REDEFINE [7] is such a CGRA in which multiple CEs are interconnected through a network on chip (NoC). NoC enables REDEFINE to scale. CEs in REDEFINE have clean interfaces to hook custom function units (CFUs) [1][4] to perform certain domain specific computation kernels. High throughput FFT realized on REDEFINE is reported in [8] in which radix-2 butterfly operations are performed using CFUs. Tilera[9] is a mesh connected interconnection of processing Tiles targeting high performance applications. Each Tile con- sists of a full-featured, 64-bit processor core, together with a flat, globally shared, cache coherent memory. Tilera can thus be viewed as a software customizable CGRA or a soft- ware enabled embedded accelerated processing unit. Xentium, from Recore Systems[10] is a multi-core SoC architecture with 9 cores interconnected in mesh topology and can be easily integrated with other system over a NoC or bus based SoC. Xentium is thus a software customizable CGRA or a software enabled embedded accelerated processing unit for DSP applications. DRRA[11] from KTH Sweden on other hand is a CGRA ideally suited for DSP applications, achieves accelerated processing through custom function accelerators within DRRA cells defined in hardware. Convey[12] is a hybrid computing platform which employs a FPGA assist as the CGRA to realize domain specific application engines. The FPGA in this case serves as a co-processor to a host with a cache coherent connection to the processor. Clearly, the existing solutions offer a variety of merits and demerits in terms of performance, flexibility and scalability. In the exascale era, it is necessary that suitable specification for a processing/compute element in a CGRA are drawn in a systematic manner so that issues related to application and architecture scalability, flexibility, power and performance are addressed early in the design phase. In this paper, we focus on deriving the specifications of compute elements for CGRAs that target NLA kernels. We derive the specifications of CEs for REDEFINE-like CGRAs through systematic algorithm analysis vis-a-vis architecture of CEs. This is achieved as follows: 1) Enhance CEs with programmable Floating Point Se- quencer (FPS) capable of supporting multiple kernels on