A Comparison of Two SIMD Implementations of the 2D Discrete Wavelet Transform Asadollah Shahbahrami 1, 2 Ben Juurlink 1 1 Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology, The Netherlands Phone: +31 15 2787362. Fax: +31 15 2784898. E-mail: {shahbahrami,benj,stamatis}@ce.et.tudelft.nl 2 Department of Electrical Engineering, Faculty of Engineering, The University of Guilan, Rasht, Iran. Abstract There are generally two algorithms to traverse an image to implement the 2D Discrete Wavelet Transform (DWT), namely Row-Column Wavelet Transform (RCWT) and Line-Based Wavelet Transform (LBWT). In the RCWT algorithm, the 2D DWT is divided into two 1D DWT: hor- izontal and vertical filtering. The horizontal filtering pro- cesses the rows of the original image and stores the wavelet coefficients in an auxiliary matrix. Thereafter, the verti- cal filtering phase processes the columns of the auxiliary matrix and stores the results back in the original matrix. In the LBWT algorithm, the vertical filtering is started as soon as a sufficient number of rows, as determined by the filter length, has been horizontally processed. In this pa- per, we provide answers to the following questions: first, which implementation is easier to vectorize using SIMD in- structions? Second, which SIMD implementation provides more performance? Our initial results for Daubechies’ transform with four coefficients show that the SIMD im- plementation of the LBWT algorithm is more complicated than the SIMD implementation of the RCWT algorithm, while the former algorithm is 1.60 times faster than the latter algorithm for an image of size 4096 × 4096. Keywords: Discrete Wavelet Transform, Multime- dia Extensions, SIMD. I. Introduction JPEG2000 is a wavelet-based image compression standard. This standard has some important fea- tures in compared to Discrete Cosine Transform (DCT) block-based JPEG standard. For example, the JPEG2000 standard provides performance superior at low bit rates, decomposes the image into a multiple resolution representation, and support region of inter- est coding [11]. The main reason why the JPEG2000 standard provides these features is due to using the Discrete Wavelet Transform (DWT). However, the This research was supported in part by the Netherlands Organ- isation for Scientific Research (NWO). DWT is the main time consuming function in the JPEG2000 standard and has higher computational re- quirements than the DCT. Our results that have been obtained by profiling the JasPer software tool kit [2] shows that the 2D DWT consumes on average 46% of the encoding time for lossless compression. For lossy compression, the DWT even requires 68% of the to- tal encoding time on average. Results presented by other researchers [1,8] also show that the 2D DWT is very time-consuming and consumes a significant part of the total JPEG2000 encoding time. Consequently, improving the performance of the 2D DWT is an im- portant issue to increase the performance of the mul- timedia compression standard. One way to improve the performance of the 2D DWT is exploiting the Data Level Parallelism (DLP) by vectorization. This is because there is DLP in this application. Vectorization determines and extracts DLP, which employs the ability to execute the Single Instruction on Multiple Data (SIMD) elements con- currently. Recently, general-purpose processors have been enhanced by the SIMD instructions such as Pen- tium 4, which includes the SSE instruction set [19]. There are generally two algorithms to traverse an image to implement the 2D DWT, namely Row- Column Wavelet Transform (RCWT) and Line-Based Wavelet Transform (LBWT) [5, 12]. In the RCWT approach, the 2D DWT is divided into two 1D DWT, namely horizontal filtering and vertical filtering. The horizontal filtering filters whole rows of an image fol- lowed by vertical filtering processes the columns. The LBWT algorithm uses a single loop to process both rows and columns together. In this paper, we provide answers to the following questions: first, which implementation is easier to vec- torize using SIMD instructions? Second, which SIMD implementation provides more performance? Our ini- tial results for Daubechies’ transform with four coeffi- 169