A High-Throughput and Memory Efficient 2-D Discrete Wavelet Transform Hardware Architecture for JPEG2000 Standard G. Dimitroulakos 1 , M. D. Galanis 2 , A. Milidonis 3 , and C.E. Goutis 4 VLSI Design Lab., Electrical and Computer Engineering Department, University of Patras, Patras, Greece {dhmhgre 1 ,mgalanis 2 ,milidon 3 ,goutis 4 @ee.upatras.gr} Abstract— The design and implementation of an efficient hardware architecture in terms of speed and memory requirements for computing the tile-based Two-Dimensional Forward Discrete Wavelet Transform for the JPEG2000 still image compression standard, is described in this paper. This architecture is derived from a well-established architecture template for calculating the Two-Dimensional Forward Discrete Wavelet Transform. The filters of that template are replaced by our previously published throughput-optimized ones. A proper scheduling algorithm has been developed that it matches to the special features of our filtering units. The performance improvements are due to the throughput- optimized filters. Also, due to the developed scheduling algorithm, reduced memory requirements are achieved when compared with previously published architectures. I. INTRODUCTION The Discrete Wavelet Transform (DWT) has been introduced as an effective and flexible methodology for subband decomposition of signals [1]. This transform exhibits good algorithmic characteristics, which are the reasons for its wide usage in contemporary multimedia compression standards, such as the JPEG2000 [2] and MPEG-4 [3]. A large variety of hardware architectures for implementing the Two-Dimensional separable Forward (2D-DWT) and Inverse DWT (2D-IDWT) have been presented [4], [5], [6], [7], [8] and [9]. These architectures are composed by filters for performing the One- Dimensional (1D) DWT and memory units for storing the results at each stage of transformation. The requirement for optimizing the filters’ architecture in terms of performance is imposed by the fact that multimedia applications, in which the DWT is a part, are characterized by high throughput requirements. The minimization of the memory size can be achieved by setting up a proper sequence of the computations, called time scheduling. The goal of scheduling is to maximize the utilization of the filtering units, and to minimize the memory buffering between the computation stages. In principle, the 1D-DWT architectures can be extended to architectures for computing the separable 2D-DWT. This is due to the fact that the separable 2D-DWT can be computed by 1D-DWT filtering on rows of an input image followed by 1D-DWT filtering on columns. In the Tile Based (TB) 2D-DWT the input image is optionally decomposed into a number of non-overlapping rectangular blocks, called tiles and the separable 2D-DWT is applied inside each tile independently. In this paper, an optimized architecture in terms of performance and memory requirements for computing the TB 2D-DWT in a JPEG2000 encoder is presented. The proposed architecture is based on a well-known architecture template, presented in [8], where the four conventional filters [5], [7] have been replaced by the four Throughput-Optimized (TO) ones presented in [10]. An efficient scheduling, that it is based on the line-based algorithm for computing the separable 2D-DWT [9] and it is suited to the filters’ special characteristics, is proposed. This scheduling algorithm minimizes 2D-DWT computation memory requirements between the levels of decomposition. Also, it results in improved throughput characteristics, which are due to the usage of our filters in the DWT architecture. The rest of the paper is organized as follows: Section 2 presents the related work while section 3 gives the basic theoretical background. Section 4 illustrates the 2D-DWT encoder architecture and presents the scheduling algorithms. The memory requirements and performance of the proposed 2D-DWT encoder architecture and their comparison with existing encoders are illustrated in section 5. Finally, section 6 concludes the paper. II. RELATED WORK Chakrabarti and Vishwanath [4] have proposed an extensible architecture for the encoder based on the non-separable 2D-DWT. This architecture consists of two parallel filtering units of size K 2 and a storage unit of size ≈ N·K. A parallel filter of size M consists of M multipliers and a tree of adders to add the M products. Vishwanath et al. [6] proposed an architecture for separable 2D- DWT, which consists of two systolic arrays of size K, two parallel filters of size K, and a storage unit of size ≈ N·(2·K+J). A disadvantage of this architecture is that two rows of the input image are supplied to the two systolic arrays every two cycles and as a result, an additional data converter is required to convert the raster scan input (one per cycle) into two per two cycles output. Chakrabarti and Mumford [8] proposed an architecture for the analysis (synthesis) filters based on the 2D-DWT. Two scheduling algorithms for computing the forward (inverse) 2D-DWT were also described. The goal was to minimize the memory requirements and to keep the data-flow as regular as possible. Zervas et al.[7] compares the three main hardware architectures for computing the 472 0-7803-8834-8/05/$20.00 ©2005 IEEE.