A Comparison of 2-D Discrete Wavelet Transform Computation Schedules on FPGAs Maria Angelopoulou 1 , Konstantinos Masselos 1 , Peter Cheung 1 , Yiannis Andreopoulos 2 1 Department of Electrical and Electronic Engineering, Imperial College London Exhibition Road, London SW7 2BT, UK {m.angelopoulou, k.masselos, p.cheung}@imperial.ac.uk 2 Department of Electrical Engineering, University of California Los Angeles 54-147 Eng. IV Building, 420 Westwood Plaza, Los Angeles, CA 90095-1594, USA yandreop@ee.ucla.edu Abstract— When it comes to the computation of the 2-D Discrete Wavelet Transform (DWT), three major computation schedules have been proposed, namely the row-column, the line-based and the block-based. In this work, the lifting-based designs of these schedules are implemented on FPGA-based platforms to execute the forward 2-D DWT, and their comparison is presented. Our implementations are optimized in terms of throughput and memory requirements, in accordance with the specifications of each one of the three computation schedules and the lifting decomposition. All implementations are parameterized with respect to the image size and the number of decomposition levels. Experimental results prove that the suitability of each implementation for a particular application depends on the given specifications, concerning the throughput and the hardware cost. I. I NTRODUCTION The two-dimensional Discrete Wavelet Transform (DWT) is a key operation in image processing, and is the kernel of both the JPEG-2000 still image compression standard [1] and the MPEG-4 still texture decoding standard [2]. The 2-D DWT is carried out by applying the 1-D DWT in both the horizontal and the vertical direction of the image. As shown in Fig. 1, each unit that executes the 1-D DWT produces two sets of coefficients: a low-frequency and a high-frequency set. The outputs of a horizontal filtering stage are vertically filtered to produce the 2-D subbands LL, LH, HL and HH. All LH, HL and HH coefficients are stored, to contribute later in the reconstruction of the original image from the LL set. The LL coefficients will either be the input of the horizontal filtering stage of the next level, if there is one, or will be stored as well, if the current level is also the last one. The traditional convolution-based 1-D DWT [3] imposed high computational complexity. The lifting scheme ( [4], [5]) overcomes this problem by factorizing the polyphase matrix of the DWT into elementary matrices. For the implementation of the 2-D DWT, several computa- tion schedules have been proposed. In practical designs, the most commonly used computation schedules are: the row- column (RC) [3], the line-based (LB) [6] and the block-based (BB) [7]. The simplest of these is RC, which adopts the level-by-level logic of Fig. 1. However, such an approach necessitates the use of large memory blocks, distant from the computational units, as the only source of the filter’s horizontal filtering stage vertical filtering stage h.f. : v.f. : H 0 L 0 HH 1 HL 1 LH 1 LL 1 HH 2 HL 2 LH 2 LL 2 H 2 L 2 HH 3 HL 3 LH 3 LL 3 IN (LL 0 ) H 1 L 1 level 0 level 1 level 2 unit that implements the forward 1D-DWT ... h.f. h.f. h.f. v.f. v.f. v.f. Fig. 1. The 2-D DWT decomposition. inputs. Contrary to RC, both LB and BB involve an on-chip memory structure that operates as a cache for the original image, minimizing the accesses of the large memory blocks. Thus, memory utilization and memory-access locality are improved. The main difference among LB and BB concerns the way the original image is traversed. Specifically, in LB, non-overlapping groups of lines are processed, whereas, BB operates using non-overlapping blocks of the image. In [8], [9] and [10] 2-D DWT computation schedules have been compared on a theoretical basis. In [11] and [12], they are compared on programmable architectures and on a VLIW DSP, respectively. Even though the above comparisons are particularly enlightening, none of them is based upon hardware implementations. Thus, the implementations involved do not take advantage of the implementation efficiency and the parallelism in data processing that hardware could offer. In addition, the vast majority of comparisons of the different alternatives focuses on convolution-based realizations and lifting is not considered. Contribution of this paper—In this paper, the three major 2-D DWT lifting-based computation schedules are implemented on FPGA-based platforms and compared in terms of performance and area. The computation schedules are compared for different image sizes (M*M) and number of levels (L) of the transform. To the best of our knowledge no comparisons of detailed hardware implementations of FPT 2006 181 0-7803-9729-0/06/$20.00 2006 IEEE