A High-Throughput and Memory Efficient 2-D
Discrete Wavelet Transform Hardware Architecture for
JPEG2000 Standard
G. Dimitroulakos
1
, M. D. Galanis
2
, A. Milidonis
3
, and C.E. Goutis
4
VLSI Design Lab., Electrical and Computer Engineering Department, University of Patras, Patras, Greece
{dhmhgre
1
,mgalanis
2
,milidon
3
,goutis
4
@ee.upatras.gr}
Abstract— The design and implementation of an efficient
hardware architecture in terms of speed and memory
requirements for computing the tile-based Two-Dimensional
Forward Discrete Wavelet Transform for the JPEG2000 still
image compression standard, is described in this paper. This
architecture is derived from a well-established architecture
template for calculating the Two-Dimensional Forward
Discrete Wavelet Transform. The filters of that template are
replaced by our previously published throughput-optimized
ones. A proper scheduling algorithm has been developed that it
matches to the special features of our filtering units. The
performance improvements are due to the throughput-
optimized filters. Also, due to the developed scheduling
algorithm, reduced memory requirements are achieved when
compared with previously published architectures.
I. INTRODUCTION
The Discrete Wavelet Transform (DWT) has been introduced as
an effective and flexible methodology for subband decomposition
of signals [1]. This transform exhibits good algorithmic
characteristics, which are the reasons for its wide usage in
contemporary multimedia compression standards, such as the
JPEG2000 [2] and MPEG-4 [3].
A large variety of hardware architectures for implementing the
Two-Dimensional separable Forward (2D-DWT) and Inverse DWT
(2D-IDWT) have been presented [4], [5], [6], [7], [8] and [9]. These
architectures are composed by filters for performing the One-
Dimensional (1D) DWT and memory units for storing the results at
each stage of transformation. The requirement for optimizing the
filters’ architecture in terms of performance is imposed by the fact
that multimedia applications, in which the DWT is a part, are
characterized by high throughput requirements. The minimization
of the memory size can be achieved by setting up a proper sequence
of the computations, called time scheduling. The goal of scheduling
is to maximize the utilization of the filtering units, and to minimize
the memory buffering between the computation stages.
In principle, the 1D-DWT architectures can be extended to
architectures for computing the separable 2D-DWT. This is due to
the fact that the separable 2D-DWT can be computed by 1D-DWT
filtering on rows of an input image followed by 1D-DWT filtering
on columns. In the Tile Based (TB) 2D-DWT the input image is
optionally decomposed into a number of non-overlapping
rectangular blocks, called tiles and the separable 2D-DWT is
applied inside each tile independently.
In this paper, an optimized architecture in terms of performance
and memory requirements for computing the TB 2D-DWT in a
JPEG2000 encoder is presented. The proposed architecture is based
on a well-known architecture template, presented in [8], where the
four conventional filters [5], [7] have been replaced by the four
Throughput-Optimized (TO) ones presented in [10]. An efficient
scheduling, that it is based on the line-based algorithm for
computing the separable 2D-DWT [9] and it is suited to the filters’
special characteristics, is proposed. This scheduling algorithm
minimizes 2D-DWT computation memory requirements between
the levels of decomposition. Also, it results in improved throughput
characteristics, which are due to the usage of our filters in the DWT
architecture.
The rest of the paper is organized as follows: Section 2 presents
the related work while section 3 gives the basic theoretical
background. Section 4 illustrates the 2D-DWT encoder architecture
and presents the scheduling algorithms. The memory requirements
and performance of the proposed 2D-DWT encoder architecture
and their comparison with existing encoders are illustrated in
section 5. Finally, section 6 concludes the paper.
II. RELATED WORK
Chakrabarti and Vishwanath [4] have proposed an extensible
architecture for the encoder based on the non-separable 2D-DWT.
This architecture consists of two parallel filtering units of size K
2
and a storage unit of size ≈ N·K. A parallel filter of size M consists
of M multipliers and a tree of adders to add the M products.
Vishwanath et al. [6] proposed an architecture for separable 2D-
DWT, which consists of two systolic arrays of size K, two parallel
filters of size K, and a storage unit of size ≈ N·(2·K+J). A
disadvantage of this architecture is that two rows of the input image
are supplied to the two systolic arrays every two cycles and as a
result, an additional data converter is required to convert the raster
scan input (one per cycle) into two per two cycles output.
Chakrabarti and Mumford [8] proposed an architecture for the
analysis (synthesis) filters based on the 2D-DWT. Two scheduling
algorithms for computing the forward (inverse) 2D-DWT were also
described. The goal was to minimize the memory requirements and
to keep the data-flow as regular as possible. Zervas et al.[7]
compares the three main hardware architectures for computing the
472 0-7803-8834-8/05/$20.00 ©2005 IEEE.