An Investigation on FPGA based SAD Hardware Implementations Stephan Wong, Bastiaan Stougie, and Sorin Cotofana Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology, Stephan@Dutepp0.ET.TUDelft.NL Abstract— In this paper, we argue that the utilization of field-programmable gate array (FPGA) structures can improve the performance of embedded systems based on programmable processor cores. Furthermore, in multime- dia processing it is well-known that the sum-of-absolute- differences (SAD) operation is the most time-consuming op- eration when implemented in software running on such pro- grammable processor cores. This is mainly due to the se- quential characteristic of such an implementation. There- fore, in this paper we investigate several hardware imple- mentations of the SAD operation and map the most promis- ing one in FPGA. Our investigation shows that an adder tree based approach yields the best results in terms of speed and area requirements and has been implemented as such by writing high-level VHDL code. Due to the limited number of I/O pins of current commercially available FPGA chips, we opted to implement the SAD over multiple chips by uti- lizing a single design. The design was functionally verified by utilizing the MAX+plus II 10.1 Baseline software pack- age from Altera Corp. and then synthesized by utilizing the LeonardoSpectrum software package from Exemplar Logic Inc. Preliminary results show that the design can be clocked at 380 Mhz. This result translates into a faster than real-time full search in motion estimation for the main profile/main level of the MPEG-2 standard. Keywords—sum of absolute difference, field-programmable gate array, hardware synthesis. I. I NTRODUCTION In video coding, similarities between video frames can be exploited to achieve higher compression ratios. How- ever, moving objects within a video scene diminish the compression efficiency of the straightforward approach that only considers pels 1 located at the same position in the video frames. In order to achieve higher compres- sion efficiency, motion estimation was introduced in an attempt to accurately capture such movements. In the MPEG-1/2 multimedia standards, it is performed for ev- ery macroblock, i.e., an array of 16 × 16 pels, in the to be encoded frame by finding its ‘best’ match in a reference 1 Pel stands for picture element and represents the smallest color data unit of a picture or video frame. frame. The most commonly used metric to evaluate the match is the “sum of absolute differences” (SAD), which adds up the absolute differences between corresponding elements in the macroblocks. The SAD operation is very time-consuming due to the complex nature of the abso- lute operation and the subsequent multitude of additions. In [15], a parallel hardware implementation was proposed to speed up the SAD computation process. This paper also describes amongst others this parallel hardware implemen- tation of the SAD operation and focus on their implemen- tation in field-programmable gate arrays (FPGAs). The reasons to utilize FPGAs are discussed in the following. Traditionally, the design of embedded multimedia pro- cessors was very much similar to microcontroller design. This meant that for each targeted set of multimedia ap- plications, an embedded multimedia processor needed to be designed in specialized hardware (commonly referred to as Application Specific Integrated Circuits (ASICs)). In the early nineties, we were witnessing a shift in the embedded processor design approach fuelled by the need for faster time-to-market times. In embedded processor, this resulted in the utilization of programmable processor cores augmented with specialized hardware units imple- mented in ASICs. Consequently, time-critical tasks were implemented in specialized hardware units while other tasks were implemented in software to be run on the pro- grammable processor core [13]. This approach allowed a programmable processor core to be re-used for different sets of applications and only the augmented units need to be (re-)designed for specific application areas. Currently, we are witnessing a new trend in embedded processor design that is again quickly reshaping the em- bedded processor design. Instead of implementing the time-critical tasks in ASICs, these tasks are to be imple- mented in field-programmable gate arrays (FPGA) struc- tures or comparative technologies [4], [14], [16], [6]. The reasons for and the benefits of such an approach include the following: • Increased flexibility: The functionality of the embed- ded processor can be quickly changed without requiring 567