STREAMING SCRATCHPAD MEMORY ORGANIZATION FOR VIDEO APPLICATIONS Aleksandar Beri´ c * , Ramanathan Sethuraman † , Harm Peters † , Gerard Veldman † , Jef van Meerbergen *,† , Gerard de Haan *,† * Eindhoven University of Technology, Dept. of Electrical Eng., Den Dolech 2, 5600MB Eindhoven, The Netherlands † Philips Research Laboratories, Prof. Holstlaan 5656AA Eindhoven, The Netherlands email: a.b.beric@tue.nl ABSTRACT To address the high data bus bandwidth requirements, the principle of locality of reference is exploited in the vast ma- jority of video processing algorithms. Especially, for appli- cation’s kernels based on motion estimation, it is inevitable to fetch the pixel data from a local storage. However, the video application kernel requirements can vary signif- icantly. The technique of data re-organization and folding (extensively used in software to map operations of same type to a single resource in a time-multiplexed fashion) are presented in the context of the design of customized streaming video scratchpad memories. These techniques transform one scratchpad organization into the another in order to best satisfy the kernel’s requirements and physical design constraints. Three instances of streaming scratch- pads suiting different kernel types are designed. Based on the RTL and netlist level simulations, their differences in performance, power dissipation and silicon area are stated. KEY WORDS streaming scratchpad memory, data reorganization, fold- ing, motion estimation 1 Introduction Due to the constant demand of quality, the complexity of both, emerging and existing video applications is on the increase. The algorithms become hungrier for processing power, memory capacity and bandwidth which as a conse- quence has increased complexity of a typical System-on- Chip (SoC) for streaming video applications. This trend is especially visible in the case of SoC’s used in mod- ern television sets for enhancing the quality of the tele- vision picture [1]. To increase the quality of the picture, high quality de-noising, picture-rate up-conversion and de- interlacing algorithms are applied. These algorithms con- tinuously reference the same pixels multiple times thus creating bottlenecks for SoC communication infrastructure (point-to-point, bus, network-on-chip) in terms of guar- anteed bandwidth and latency and increasing the power dissipation. The key to the high-performance low-power SoC for streaming video is the organization of the mem- ory subsystem. (Multi-level) buffering is the proven way to achieve the requested performance, to reduce the band- width requirements of the background memory and the y x (a) (b) (c) Figure 1. Picture illustrates typical kernels present in video applications power dissipation of the whole SoC. In this paper, we make an attempt to address the problem of a scratchpad mem- ory design in the scope of streaming video applications. In video applications, we can ﬁnd several types of kernel oper- ations as depicted in ﬁgure 1. Fig. 1a shows the case when kernel operates on horizontal pixel-line to process one out- put pixel (marked as a black square). In ﬁg. 1b, the kernel operates on a set of vertical pixels. In ﬁg. 1c, the kernel operates on a circular footprint of pixels surrounding the central pixel in order to generate one output pixel. These kernels can also be extended to cover pixel data from dif- ferent temporal instances. For instance in ﬁg. 1c, the white pixels can be from time instance ’t+1’ while the black pix- els can be from time instance ’t’. In its most generic form, a kernel can operate on O t spatial pixel data (from time instance ’t’) apart from P t-m ,.., Q t+n temporal pixel data (from time instance ’t-m’,..,’t+n’ respectively). The diver- sity in kernel types makes it difﬁcult to provide a single scratchpad organization that best suits the kernel. Hence the need for different scratchpad organizations to suit the requirements of different kernel types. Streaming scratchpads (or also referred as buffers in literature) have been extensively studied for standardized processors [2, 3]. However, to the best knowledge of the authors, there is no prior work (except those of the authors [4] for customized processors [5, 6]. One such snapshot of a customized processor is depicted in ﬁgure 2. The proces- sor contains the standard functional units (ALU, ACU, etc.) and application speciﬁc functional units (sum-of-absolute- differences (SAD) and bi-linear interpolation (BI)). These application speciﬁc functional units for instance can pro- cess 16 pixels (128 bits) in parallel. Hence, the scratch- pad memory units need to deliver 16 pixel per read access. In this paper, we focus on the design challenges of three scratchpads in the context of video applications. The remainder of this paper is organized as fol- lows: Section 2 presents concepts of data re-organization