ACCELERATION AND IMPLEMENTATION OF JPEG2000 ENCODER ON TI DSP
PLATFORM
Chien-Chih Liu and Hsueh-Ming Hang
Electronics Engineering Department, National Chiao Tung University, Hsinchu, Taiwan, R.O.C.
ccliu.iic93g@nctu.edu.tw , hmhang@mail.nctu.edu.tw
This work was partially supported by National Science Council, Taiwan, R.O.C., under Grant NSC-94-2213-E-009-144.
ABSTRACT
JPEG2000 provides excellent compression performance and
fine granularity scalability but at the cost of high
computational complexity. We propose two speed-up
techniques and use the TI DSP optimization tools to
accelerate the Tier1 module. We eliminate the unnecessary
checking cycles by recording the NBC (Need-to-Be-Coded)
samples on a list. Furthermore, the sample index is
reordered to facilitate fast execution. In the DSP
implementation of the proposed methods, we use code
acceleration techniques, cache memory allocation, and TI
DSP compiler-level optimization tools. Even when the
original program is compiled with the same DSP
optimization tools and proper cache assignment, our fast
algorithm can still reduce the computation by 45%.
Index Terms— JPEG200, DSP, algorithm acceleration
1. INTRODUCTION
In contrast to the discrete cosine transform (DCT) used in
the JPEG standard, the JPEG2000 standard [1] implements
an entirely new way of compressing images based on the
wavelet transform. It supports lossy and lossless
compression of single-component (gray-level) and multi-
component (color) images. The major operation blocks of
the JPEG2000 encoding system are shown in Figure 1. The
pre-processing includes the image tiling, DC-Level shifting,
and component transform. The component transform and
the discrete wavelet transform have both the irreversible
mode and the reversible mode used for lossy and lossless
coding, respectively. The entropy coding part of JPEG2000
adopts the EBCOT technique (Embedded Block Coding
with Optimized Truncation) [2]. It consists of two major
coding steps, Tier-1 and Tier-2. The Tier-1 part is an
embedded block coding scheme consists of the context
formation (CF) and the arithmetic encoder (AE). The Tier-2
and rate-control part adopts the PCRD (Post-Compression
Rate-Distortion) optimization to truncate the embedded bit-
stream to minimize the overall distortion.
Pre-Processing
Input
Image
Forward Discrete
Wavelet Transform
Uniform Scalar
Quantization
Tier-1 Tier-2
Rate-Control
Coded
Image
Tiling DC-Level shifting Component Transform Tiling DC-Level shifting Component Transform
Real mode
Integer mode
Real mode
Integer mode
Scalability Scalability
Figure 1 JPEG2000 encoder architecture
2. ENCODER COMPLEXITY ANALYSIS
We implement a JPEG2000 encoder on a DSP platform
including two Sundance modules, SMT395 (TI
TMS320C6416T DSP) and SMT310. We start with the
OpenJPEG (ver.1.0) [3] reference software in C language.
Then, the TI CCS (Code Composer Studio ver.3.1) [4] is
used to compile the C codes and profile the encoder
complexity.
2.1. Profiling results
TI CCS provides many simulation tools. The C64xx CPU
cycles accurate simulator assumes a flat memory system in
simulating the C64xx processor. In contrast, the C6416
device cycle accurate simulator can provide an accurate
simulation on the C6416 processor, peripherals, and
memory system.
Table I Profiling lossless encoding results using two
simulators (Goldhill 512x512)
Simulator C64xx % C6416 % Ratio
DWT 73,327,701 7.8 552,674,115 6.7 13 %
Tier1 846,100,912 90.9 7,509,481,733 91.7 11%
Tier2 1,550,147 <1 15,933,932 <1 9%
Others 9,475,720 1 103,399,183 1.2 9%
Total 930,454,480 100 8,181,488,963 100 11%
III - 329 1-4244-1437-7/07/$20.00 ©2007 IEEE ICIP 2007