Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform Asadollah Shahbahrami Ben Juurlink Stamatis Vassiliadis Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology, The Netherlands E-mail: {shahbahrami,benj,stamatis}@ce.et.tudelft.nl Abstract This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms con- sidered are Daubechies’ real-to-real method of four co- efficients (Daub-4) and the integer-to-integer (5, 3) lifting scheme. Daub-4 is implemented using SSE and the lifting scheme using MMX, and their performance is compared to C implementations on a Pentium 4 processor. The MMX im- plementation of the lifting scheme is up to 4.0x faster than the corresponding C program for a 1-level 2D DWT, while the SSE implementation of Daub-4 is up to 2.6x faster than the C version. It is shown that for some image sizes, the performance is significantly hampered by the so-called 64K aliasing problem, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart. It is also shown that for the (5, 3) lifting scheme, a 12-bit word size is sufficient for a 5-level decomposition of the 2D DWT for images of up to 10 bits per pixel. Keywords: Discrete Wavelet Transform, lifting scheme, SIMD extensions. 1 Introduction The wavelet transform is mainly used for image and video compression. Standards such as MPEG-4 and JPEG2000 [13] are based on the 2D discrete wavelet trans- form (DWT). The DWT has traditionally been implemented by convolution methods such as finite impulse response (FIR) filters. These implementations require both a large number of operations and a large amount of memory, mak- ing them unsuitable for either high-speed or low-power im- plementations. One way to reduce the execution time of the DWT is by using special-purpose hardware. Programmable processors, however, are preferable because they are more flexible and allow different transforms, various filter bank lengths, and various transform levels. Furthermore, multi- media SIMD extensions such as MMX [12] and SSE [14] can be used to reduce the execution time of the DWT. In this paper the performance of two wavelet transforms, conventional real-to-real filtering and the integer-to-integer lifting scheme, is evaluated. Both methods are imple- mented using programmable SIMD architectures. Hence, we present an MMX implementation of the lifting scheme and compare its performance to an SSE implementation of the convolution method. The lifting scheme is considered with the goal to provide a fast and efficient implementation of the DWT to reduce the execution time of JPEG2000. The (5, 3) lifting scheme is considered for various rea- sons. First, the (5, 3) transform has low computational com- plexity and performs reasonably well for lossy as well as lossless compression compared to other filters [1]. Second, the (5, 3) transform is included in Part 1 of the JPEG2000 standard [13]. Third, it is possible to implement the (5, 3) filter without using multiplication operations (i.e., using only addition, subtraction, and shift operations). Finally, the (5, 3) filter has only one lifting step. Transforms with fewer lifting steps tend to perform better than transforms with more lifting steps in terms of speed as well as accu- racy [1]. The convolution method considered in this paper is Daubechies’ transform with four coefficients (Daub-4). This transform has been considered in many papers [2, 9]. This paper is organized as follows. Section 2 briefly de- scribes the wavelet transform and explains the SSE imple- mentation of the 2D DWT using Daub-4. In Section 3, the MMX implementation of the (5, 3) lifting scheme is dis- cussed. In Section 4 the performance of both SIMD imple- mentations and their C counterparts is evaluated and ana- lyzed. In Section 5 we discuss the limitations of MMX and SSE that restrict the performance improvements that can be obtained for the 2D DWT. Related work is described in Sec- tion 6 and conclusions are drawn in Section 7.