Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). MEMSYS '15, October 05-08, 2015, Washington DC, DC, USA ACM 978-1-4503-3604-8/15/10. http://dx.doi.org/10.1145/2818950.2818956 A Data Centric Perspective on Memory Placement Yitzhak Birk Technion – Israel Inst. of Technology Haifa 3200003 Israel birk@ee.technion.ac.il Oskar Mencer Maxeler Technologies Ltd. and Imperial College London, UK mencer@maxeler.com ABSTRACT In this paper, we focus on memory in its role as a channel for passing information from one instruction to another; in particular, in conjunction with spatial or dataflow computing architectures, wherein the computing elements are laid out like an assembly plant. We point out the opportunity to dramatically increase effective data access bandwidth by going from a centralized memory array model with a few ports to numerous tiny buffers that can be accessed concurrently. The penalty is loss in access flexibility, but this flexibility is often a by-product of the memory organization rather than a true need. The improvements in hardware reconfiguration speed and resolution, combined with definition of standard buffer queuing and routing capabilities and efforts by tool designers and application developers are likely to extend the applicability of those architectures, offering dramatic power-cost-performance advantages. Categories and Subject Descriptors • Dataflow architectures • Reconfigurable Computing • Parallel architectures • Spatial Computing. Keywords Spatial computing; dataflow engine; data centric; memory organization. 1. INTRODUCTION 1.1 Data Centric Thinking Access to data and its movement within a system have become the performance and/or power bottleneck in an ever increasing fraction of systems and application settings. Yet, system designs are still centered mostly about computation, with most else being viewed as a necessary evil. Even the term "Dataflow machine" refers mostly to the removal of constraints on execution order other than data dependences, saying nothing about the amount or efficiency of data access and flow. Instead, we advocate a "data centric" approach, whereby data and access to it are placed at the center, are the focus of architectural thinking, and other more abundant resources are spent in order to help improve data access. The data centric approach has three prongs: 1) pushing forward on memory and storage system designs, as well as related communication, 2) rethinking algorithms and applications with data-access related cost serving as a central measure, (E.g., designing image processing algorithms in a manner that reduces data movement, increases access locality, etc., rather than focus on computational complexity;), and 3) rethinking architectures altogether in an attempt to gain performance leaps at the expense of giving up the convenience of the traditional Von Neumann architecture. The latter includes in-memory computing, whereby memory is still "the" memory. We focus on the third prong, but in a different direction. 1.2 Existing Architectures - a D-C Perspective Memory organizations for stored program (Von Neumann) machines abide by an "axiom" whereby any memory location should be accessible to any requesting entity. Hierarchical memory extends this, using caching to expedite access to a small fraction of addresses at any given time; non-uniform architectures for multi-core processors extend this by making certain addresses closer to certain cores. Yet, the basic rules are not broken. Even accelerators such as GPUs are designed within this general framework. The benefit of this approach is flexibility and the ability to execute any program, albeit with variable levels of efficiency, but this comes at the cost of an inherently limited access concurrency. Also, memory array organization has access control overhead that is often only acceptable when amortized over substantial memory sizes. The basic Von Neumann stored-program architecture comprises fixed hardware that can be programmed to perform any desired function. ASICs (at least "pure" ones) comprise fixed hardware that can only carry out a specific function. Field programmable gate arrays (FPGAs) form a third category: reconfigurable hardware. Their most common use, however, is motivated by techno-business considerations such as the need for a particular set of on- chip elements (System on Chip) but volume that doesn't justify an ASIC, by the desire to permit changes, etc. In these situations, FPGAs are used for implementing specific architectures that themselves fall in one of the first two categories.