Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform Asadollah Shahbahrami Ben Juurlink Stamatis Vassiliadis Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology, The Netherlands E-mail: {shahbahrami,benj,stamatis}@ce.et.tudelft.nl Abstract This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms con- sidered are Daubechies’ real-to-real method of four co- efﬁcients (Daub-4) and the integer-to-integer (5, 3) lifting scheme. Daub-4 is implemented using SSE and the lifting scheme using MMX, and their performance is compared to C implementations on a Pentium 4 processor. The MMX im- plementation of the lifting scheme is up to 4.0x faster than the corresponding C program for a 1-level 2D DWT, while the SSE implementation of Daub-4 is up to 2.6x faster than the C version. It is shown that for some image sizes, the performance is signiﬁcantly hampered by the so-called 64K aliasing problem, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart. It is also shown that for the (5, 3) lifting scheme, a 12-bit word size is sufﬁcient for a 5-level decomposition of the 2D DWT for images of up to 10 bits per pixel. Keywords: Discrete Wavelet Transform, lifting scheme, SIMD extensions. 1 Introduction The wavelet transform is mainly used for image and video compression. Standards such as MPEG-4 and JPEG2000 [13] are based on the 2D discrete wavelet trans- form (DWT). The DWT has traditionally been implemented by convolution methods such as ﬁnite impulse response (FIR) ﬁlters. These implementations require both a large number of operations and a large amount of memory, mak- ing them unsuitable for either high-speed or low-power im- plementations. One way to reduce the execution time of the DWT is by using special-purpose hardware. Programmable processors, however, are preferable because they are more ﬂexible and allow different transforms, various ﬁlter bank lengths, and various transform levels. Furthermore, multi- media SIMD extensions such as MMX [12] and SSE [14] can be used to reduce the execution time of the DWT. In this paper the performance of two wavelet transforms, conventional real-to-real ﬁltering and the integer-to-integer lifting scheme, is evaluated. Both methods are imple- mented using programmable SIMD architectures. Hence, we present an MMX implementation of the lifting scheme and compare its performance to an SSE implementation of the convolution method. The lifting scheme is considered with the goal to provide a fast and efﬁcient implementation of the DWT to reduce the execution time of JPEG2000. The (5, 3) lifting scheme is considered for various rea- sons. First, the (5, 3) transform has low computational com- plexity and performs reasonably well for lossy as well as lossless compression compared to other ﬁlters [1]. Second, the (5, 3) transform is included in Part 1 of the JPEG2000 standard [13]. Third, it is possible to implement the (5, 3) ﬁlter without using multiplication operations (i.e., using only addition, subtraction, and shift operations). Finally, the (5, 3) ﬁlter has only one lifting step. Transforms with fewer lifting steps tend to perform better than transforms with more lifting steps in terms of speed as well as accu- racy [1]. The convolution method considered in this paper is Daubechies’ transform with four coefﬁcients (Daub-4). This transform has been considered in many papers [2, 9]. This paper is organized as follows. Section 2 brieﬂy de- scribes the wavelet transform and explains the SSE imple- mentation of the 2D DWT using Daub-4. In Section 3, the MMX implementation of the (5, 3) lifting scheme is dis- cussed. In Section 4 the performance of both SIMD imple- mentations and their C counterparts is evaluated and ana- lyzed. In Section 5 we discuss the limitations of MMX and SSE that restrict the performance improvements that can be obtained for the 2D DWT. Related work is described in Sec- tion 6 and conclusions are drawn in Section 7.