IEEE TRANSACTIONS ON CIRCUITS SYSTEMS FOR VIDEO TECHNOLOGY 1 Memory-Efficient High-Speed Convolution-based Generic Structure for Multilevel 2-D DWT Basant Kumar Mohanty, Senior Member, IEEE, Pramod Kumar Meher, Senior Member, IEEE Abstract—In this paper, we have proposed a design strategy for the derivation of memory-efficient architecture for multi-level 2-D DWT. Using the proposed design scheme, we have derived a convolution-based generic architecture for the computation of 3-level 2-D DWT based on Daubechies as well as bi-orthogonal filters. The proposed structure does not involve frame-buffer. It involves line-buffers of size 3(K - 2)M/4 which is independent of throughput-rate, where K is the order of Daubechies/bi- orthogonal wavelet filter and M is the image height. This is a major advantage when the structure is implemented for higher throughput. The structure has regular data-flow, small cycle period TM and 100% hardware utilization efficiency. As per theoretical estimate, the proposed structure for image-size 512 × 512 and Daub-4 filter requires 152 more multipliers and 114 more adders, but involves 82412 less memory words and takes 10.5 times less time to compute 3-level 2-D DWT than the best of the existing convolution-based folded structures. Similarly, compared with the best of the existing lifting-based folded structures, proposed structure using 9/7-filter for the same image-size involves 93 more multipliers and 166 more adders, but uses 85317 less memory words and requires 2.625 times less computation time. It involves 90 (nearly 47.6%) more multipliers and 118 (nearly 40.1%) more adders, but requires 2723 less memory words than the recently proposed parallel structure and performs the computation in nearly half the time of the other. Inspite of having more arithmetic components than the lifting- based structures, the proposed structure offers significant saving in area and power over the other due to substantial reduction in memory size and smaller clock-period. ASIC synthesis result shows that, the proposed structure using Daub-4 involves 1.7 times less area-delay-product (ADP) and consumes 1.21 times less energy per image (EPI) than the corresponding best available convolution-based structure. It involves 2.6 times less ADP and consumes 1.48 times less EPI than the parallel lifting-based structure 1 . Index Terms— Systolic array, VLSI, lifting, discrete wavelet transform (DWT), 2-dimensional (2-D) DWT. I. I NTRODUCTION T WO dimensional (2-D) discrete wavelet transform (DWT) is widely used in image and video compres- sion [1]. The input image is required to be decomposed into multilevel DWT to achieve higher compression ratio. The multilevel 2-D DWT on the other hand, being highly Manuscript submitted November 05, 2011, revised on January 23, 2012, March 16, 2012 and April 23, 2012. This paper was recommended by Associate Editor Tian-Sheuan Chang. Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org B. K. Mohanty is with the Department of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhya Pradesh, India-473226, (email: bk.mohanti@juet.ac.in). P. K. Meher is with the Department of Embedded Systems, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632, Email: pkmeher@i2r.a-star.edu.sg, URL: http://www1.i2r.a-star.edu.sg/pkmeher/. 1 In case of existing structures, the power consumed for frame-buffer access is not accounted. computation-intensive and memory-intensive, is implemented in VLSI system to meet the temporal requirement of real-time applications. Due to its ever increasing usage in high data-rate communication and storage, through portable and hand-held devices, VLSI implementation of 2-D DWT is subjected to a set of incompatible constraints. e. g., the silicon-area and power-consumption along with its minimum processing-speed for real-time computation. Several architectures have therefore been suggested in the last few years for constraint-driven VLSI implementation of 2-D DWT. Multilevel 2-D DWT can be implemented by recursive pyramid algorithm (RPA) [2]. But, the hardware utilization efficiency of the RPA-based structure is always less than 100%, and it involves complex control circuits. To overcome this problem, Wu et al [3] have suggested a folded scheme, where multi-level DWT computation is performed level-by- level using one filtering unit and one external buffer. Unlike RPA-based designs, folded design involves simple control circuitry and it has 100 % HUE. In general, the folded structure consists of a pair of 1- D DWT modules (row and column processor) and mem- ory/storage components. The memory component consists of a frame-memory, transposition-memory and temporal-memory [6]. Frame-memory is required to store the low-low sub- band for level-by-level computation of multilevel 2-D DWT. Transposition-memory stores the intermediate values resulting from the row processing, while temporal-memory is used by the column processor to store the partial results. Frame- memory may either be on-chip or external, while the other two are on-chip memories. Transposition-memory size depends mainly on the type of data-access scheme adopted to feed the input data, while temporal-memory size depends on the number of intermediate data registers required by the 1- D module to store the partial results. In general, the sizes of the transposition-memory and temporal-memory are some multiple of width of the input image, while the size of frame- memory is of the order of the image size. On other hand, the complexity of each 1-D module depends on the size of the wavelet filter and the computation scheme (convolution-based or lifting-based computation) is used to implement the filter. Since, the size of the images is usually very large compared to the size of the filter, complexity of memory component forms the major part of overall complexity of the 2-D structure. Cheng et al [7] have suggested a parallel data-access scheme to reduce the size of the transposition-memory. Folded structures based on this scheme requires 4N memory words for transposition and temporal memory [7], [10]. The memory requirement of [7] is the lowest among all the lifting 2-D DWT structures. Subsequently, Meher et al [8] have proposed a similar parallel data-access scheme for convolution-based