DIAGONAL VECTORISATION OF 2-D WAVELET LIFTING David Barina Faculty of Information Technology Brno University of Technology Czech Republic Pavel Zemcik Faculty of Information Technology Brno University of Technology Czech Republic ABSTRACT With the start of the widespread use of discrete wavelet trans- form in image processing, the need for its efficient imple- mentation is becoming increasingly more important. This work presents a novel SIMD vectorisation of 2-D discrete wavelet transform through a lifting scheme. For all of the tested platforms, this vectorisation is significantly faster than other known methods, as shown in the results of the experi- ments. Index Terms— Discrete wavelet transforms, Image pro- cessing 1. INTRODUCTION The discrete wavelet transform (DWT) [1] is mathematical tool which is suitable to decompose discrete signal into low- pass and highpass frequency components. Such a decompo- sition can even be performed at several scales. It is often used as a basis of sophisticated compression algorithms. Considering the number of arithmetic operations, the lift- ing scheme [2] is currently the most efficient way for comput- ing the discrete wavelet transform. This paper focuses on the CDF (Cohen-Daubechies-Feauveau) 9/7 wavelet [3] which is often used for image compression (e.g., JPEG 2000 standard). Responses of this wavelet can be computed by a convolution with two FIR filters, one with 7 and the other with 9 coeffi- cients. The transform employing such a wavelet can be com- puted with four successive lifting steps as shown in [2]. Re- sulting coefficients are then divided into two disjoin groups – approximate and detail coefficients, or L and H subbands. The simple approach of lifting data flow graph evaluation directly follows the lifting steps. This approach suffers with several reads and writes of intermediate results. However, more effi- cient ways of lifting evaluation [4] [5] exist. In case of two-dimensional transform, the DWT can be realized using separable decomposition scheme [6]. In this scheme, the coefficients are evaluated by successive horizon- tal and vertical 1-D filtering resulting in four disjoin groups (LL, HL, LH and HH subbands). A naive algorithm of 2-D DWT computation will directly follow horizontal and vertical filtering loops. Unfortunately, this approach is encumbered with several accesses into intermediate results. The horizon- tal and vertical loop can be fused into single one yielding into the single-loop approach [7]. In present personal computers, a general purpose micropro- cessor with SIMD (single instruction, multiple data) instruc- tion set is often found. For example, in x86 architecture, the appropriate instruction set is SSE (Streaming SIMD Exten- sions). This 4-fold SIMD set fits exactly the CDF 9/7 lift- ing data flow graph when using the single precission floating- point format. In this paper, the diagonal vectorisation of wavelet lifting recently published [5] is incorporated into the known single- loop approach. This new implementation is compared to the original one using the vertical vectorisation as well as to the naive approach with separated horizontal and vertical loops. For tested platforms, this new combination is consistently sig- nificantly faster than the original approach employing vertical vectorisation. This paper is focused on the present computers with x86 architecture. All the methods presented in this paper are eval- uated using ordinary PCs with Intel x86 CPUs. Intel Core2 Quad Q9000 running at 2.0 GHz was used. This CPU has 32 kiB of level 1 data cache and 3 MiB of level 2 shared cache (two cores share one cache unit). The results were verified on system with AMD Opteron 2380 running at 2.5 GHz. This CPU has 64 kiB of level 1 data, 512 kiB of level 2 cache per core and 6 MiB of level 3 shared cache (all four cores share one unit). Another set of control measurements was done on Intel Core2 Duo E7600 at 3.06 GHz and on AMD Athlon 64 X2 4000+ at 2.1 GHz. These are referred to as alternative plat- forms. Due to limited space, the details will not be shown here with the exception of a summarizing table. All algorithms be- low were implemented in C languade using SSE compiler in- trinsics. 1 In all cases, 64-bit code compiled using GCC 4.8.1 with -O3 flag was used. The rest of the paper is organised as follows. Related Work section discusses the state of the art – especially lifting scheme, vectorisations and 2-D single-loop approach. Single- Loop Approach section focuses on this 2-D approach in more 1 The code can be downloaded from http://www.fit.vutbr.cz/research/prod/?id=211. Copyright 2014 IEEE. Published in the IEEE 2014 International Conference on Image Processing (ICIP 2014), scheduled for October 27-30, 2014, in Paris, France. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.