56 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 7, NO. 1, MARCH 1999 An Efficient VLSI Architecture for 2-D Wavelet Image Coding with Novel Image Scan Gauthier Lafruit, Francky Catthoor, Member, IEEE, Jan P. H. Cornelis, Member, IEEE, and Hugo J. De Man, Fellow, IEEE Abstract—A folded very large scale integration (VLSI) architec- ture is presented for the implementation of the two-dimensional discrete wavelet transform, without constraints on the choice of the wavelet-filter bank. The proposed architecture is dedicated to flexible block-oriented image processing, such as adaptive vector quantization used in wavelet image coding. We show that reading the image along a two-dimensional (2-D) pseudo-fractal scan creates a very modular and regular data flow and, therefore, considerably reduces the folding complexity and memory require- ments for VLSI implementation. This leads to significant area savings for on-chip storage (up to a factor of two) and reduces the power consumption. Furthermore, data scheduling and memory management remain very simple. The end result is an efficient VLSI implementation with a reduced area cost compared to the conventional approaches, reading the input data line by line. I. INTRODUCTION D ATA compression for image transmission is concerned with reducing the amount of bits needed to adequately transmit an image over limited bandwidth networks. This data reduction is achieved by removing the spatial redundancy in a still image and the temporal redundancy between successive images in a video sequence. In recent years, the wavelet- based coding techniques [1] have emerged and defied the more classical discrete cosine transform (DCT)-based coding algorithms. A large number of different wavelet compression algorithms are described in the literature, in which com- pression is performed by scalar quantization [2], [3], vector quantization [3]–[9] and related methods, e.g., the zero-tree coding [10] and the pyramid vector quantization [11], [12]. Each technique has its merits and application domain. This paper is devoted to the vector quantization coding technique and describes a completely new approach for the algorithmic optimization of the combination of a multirate algorithm [i.e., the two-dimensional discrete wavelet transform (2-D-DWT) of Fig. 1], with single-rate digital signal processor (DSP) mod- ules (i.e., vector quantization and motion vector estimation), in order to satisfy the area/performance/power constraints in the application specific integrated circuit (ASIC) design [13] of a wavelet image coder. As the background memory is typically a bottleneck in video and image processing, both for storage and access bandwidth [14], the memory and communication access organization should be optimized and decided on before Manuscript received June 28, 1996; revised August 26, 1998. G. Lafruit is with IMEC, B-3001 Heverlee, Belgium, and is also with the Vrije Universiteit Brussel, ETRO, B-1050 Brussels, Belgium. F. Catthoor and H. J. De Man are with IMEC, B-3001 Heverlee, Belgium. J. P. H. Cornelis is with the Vrije Universiteit Brussel, ETRO, B-1050 Brussels, Belgium. Publisher Item Identifier S 1063-8210(99)00692-7. the matching data-path organization is derived. We show that the classical line-by-line image-scanning sequence introduces a high memory cost for the block-based operators used in wavelet image coders. The ASIC memory bottleneck will be tackled by algorithmic innovations, guided by the knowledge of underlying hardware and circuits properties, leading to a simple and efficient architecture for the wavelet image coder. We show that by subdividing the data in block units and reading them along a 2-D pseudo-fractal scan curve, a high match is obtained between the wavelet decomposition stage and the subsequent block-oriented vector quantization and motion vector estimation processing, lowering the size of intermediate frame buffer memories. Ideally, data can be read from an unique input frame memory and intermediate data values are stored on-chip for reducing off-chip memory accesses. This lowers the power consumption, simplifies the implementation for a target processing rate, and enables the use of one unique input frame memory. We show that for a practical implementation of a three-tap high-pass/nine-tap low-pass wavelet compression scheme (see Table I) dedicated to adaptive vector quantization, all intermediate results can be stored on-chip, whereas this would be less feasible in classical wavelet decomposition implementations. Furthermore, the size of the data blocks created in the intermediate wavelet levels remains constant. The use of a fixed block size for the data units in the different levels of the wavelet pyramid ensures flexibility, i.e., the vector quantization block size can still be chosen in a range between one and the input data-block size. Finally, we describe some optimizations for the data-path synthesis of the so-called wavelet processor (WP) module, which is in charge of the filtering calculations and subsampling operations. We use a procedure similar to [15] and [16] to avoid hard- wired variable multipliers, by realizing the constant multipli- cations as shift-add expansions with the canonical-signed-digit (CSD) coding [17]–[19]. Our main contribution is, however, the heavily optimized hardware sharing between the low-pass and high-pass filter operations. This assignment of operations to data-path resources differs significantly from the folded and bit-serial architectures proposed in the literature [15], [16], [20]–[26] and results in a very low overhead in terms of connections and internal data-path registers. We also make use of advanced architecture synthesis tools to realize this, offered within the Cathedral-3 environment [27], [28]. The synergy between the resulting overall optimized 2-D- DWT VLSI chip and an adaptive vector quantization algorithm 1063–8210/99$10.00  1999 IEEE