202 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 1, NO. 2, JUNE 1999 Low Power Memory Storage and Transfer Organization for the MPEG-4 Full Pel Motion Estimation on a Multimedia Processor Erik Brockmeyer, Lode Nachtergaele, Francky V. M. Catthoor, Member, IEEE, Jan Bormans, Member, IEEE, and Hugo J. De Man, Fellow, IEEE Abstract— Data transfers and storage are crucial cost factors in multimedia systems. Systematic methodologies are needed to obtain dramatic reductions in terms of power, area and cycle count. Upcoming multimedia processing applications will require high memory bandwidth. In this paper, we estimate that a software reference implementation of an MPEG-4 video encoder typically requires five Gtransfers/s to main memory for a simple profile level L2. This shows a clear need for optimization and the use of intermediate memory stages. By applying our ACROPOLIS methodology, developed mainly to relieve this data access bottleneck, we have arrived at an implementation which needs a factor 65 less background accesses. In addition, we also show that we can heavily improve on the memory transfers, without sacrificing speed (even gaining about 10% on cache misses and cycles for a DEC Alpha), by aggressive source code transformations. I. INTRODUCTION N EXT generation multimedia systems impose heavy de- mands on the data transfer and storage subsystem [27], [35]. To communicate and hold the massive amounts of data that represent media, fast busses and large memories with high access rates from/to processors are needed. Efficient implementation of the complex media algorithms requires a global analysis of the critical sections and code transformations to eliminate or at least alleviate the impact of these bottlenecks. The recent MPEG-4 standard [33] is a key to multimedia applications. It involves complex data-dominant algorithms. A hardware or even an embedded software realization of such a (de)coder has to be power efficient in order to reduce the size of the chip packages (where it is embedded) or the battery (if used in a mobile application). It is well-known by now that any future complex chip realization has to take power reduction into account [35]. Our previous research shows clearly the dominant power contribution of data transfer and storage of multidimensional (M-D) array signals and other complex data types in data-dominated designs [6], [27] such as MPEG-4. In this paper we have exploited this feature to achieve large savings in the system power of a crucial part of MPEG- 4, without sacrificing on the performance or on the system latency. The results support our claim that data transfer and Manuscript received September 9, 1998; revised January 6, 1999. The associate editor coordinating the review of this paper and approving it for publication was Prof. Jan-Ming Ho. The authors are with the Katholieke Universiteit Leuven, Leuven, 3001 Belgium (e-mail: brockmey@imec.be). Publisher Item Identifier S 1520-9210(99)04097-3. storage exploration (DTSE) and optimization for multimedia algorithms have to be performed aggressively before the algorithms are realized in hardware and/or embedded software. The MPEG-4, multimedia (TriMedia) processor context and the related work are explained in the following Sections II, III and IV. In the next two section, the used memory power model and the MPEG-4 profiling data will be discussed. Section VII and VIII will explain two major steps of our methodology being global loop transformations and the data reuse step. In which the motion estimation kernel will be used as an example. The main topic (in Section IX), will be the transformation of the motion estimation code within a group of VOP’s to reduce the memory power. This section also includes an estimation for the number of accesses and a comparison to measured results. All the work so far assumes a software controlled cache. In Section XI, the gain of our methodology is analyzed for a hardware controlled cache, in the case of an H.263 decoder. II. MPEG-4 MOTION ESTIMATION CONTEXT The purpose of the MPEG-4 Video Verification Model (VM) is to describe completely defined encoding and decoding “common core” algorithms and to allow the conduction of experiments under controlled conditions in a common environ- ment [33]. The exploration in this paper has been performed on the MoMuSys Video VM Version 7.0 [30]. The MPEG-4 standard enables an efficient coded repre- sentation of the audio and video data that can be “content based,” with the aim to use and present the data in a highly flexible way. Every object is coded on its own: the decoder can scale, place and extract the objects from different sources. The size, position and content of the object are variable during a sequence. The video object planes (VOP’s), containing coded video sequences and shape information, are divided in MB’s (MacroBlock: a group of 16 16 pixels). To exploit the temporal redundancy of a sequence, a H.263 like motion estimation is used. The arrows in Fig. 1 represent all motion estimation steps for one group of VOP’s. The original source uses the baseline, well known, full search motion estimation to generate the motion vectors (MV). The motion vectors can code the information more efficient by using the temporal redundancy and constructing the next VOP out of the previous VOP. All VOP’s are divided in Mac- roBlocks (MB 16 16 pixels), the MB’s are sequentially executed in the motion estimation. 1520–9210/99$10.00 1999 IEEE