ACCELERATION AND IMPLEMENTATION OF JPEG2000 ENCODER ON TI DSP PLATFORM Chien-Chih Liu and Hsueh-Ming Hang Electronics Engineering Department, National Chiao Tung University, Hsinchu, Taiwan, R.O.C. ccliu.iic93g@nctu.edu.tw , hmhang@mail.nctu.edu.tw This work was partially supported by National Science Council, Taiwan, R.O.C., under Grant NSC-94-2213-E-009-144. ABSTRACT JPEG2000 provides excellent compression performance and fine granularity scalability but at the cost of high computational complexity. We propose two speed-up techniques and use the TI DSP optimization tools to accelerate the Tier1 module. We eliminate the unnecessary checking cycles by recording the NBC (Need-to-Be-Coded) samples on a list. Furthermore, the sample index is reordered to facilitate fast execution. In the DSP implementation of the proposed methods, we use code acceleration techniques, cache memory allocation, and TI DSP compiler-level optimization tools. Even when the original program is compiled with the same DSP optimization tools and proper cache assignment, our fast algorithm can still reduce the computation by 45%. Index Terms— JPEG200, DSP, algorithm acceleration 1. INTRODUCTION In contrast to the discrete cosine transform (DCT) used in the JPEG standard, the JPEG2000 standard [1] implements an entirely new way of compressing images based on the wavelet transform. It supports lossy and lossless compression of single-component (gray-level) and multi- component (color) images. The major operation blocks of the JPEG2000 encoding system are shown in Figure 1. The pre-processing includes the image tiling, DC-Level shifting, and component transform. The component transform and the discrete wavelet transform have both the irreversible mode and the reversible mode used for lossy and lossless coding, respectively. The entropy coding part of JPEG2000 adopts the EBCOT technique (Embedded Block Coding with Optimized Truncation) [2]. It consists of two major coding steps, Tier-1 and Tier-2. The Tier-1 part is an embedded block coding scheme consists of the context formation (CF) and the arithmetic encoder (AE). The Tier-2 and rate-control part adopts the PCRD (Post-Compression Rate-Distortion) optimization to truncate the embedded bit- stream to minimize the overall distortion. Pre-Processing Input Image Forward Discrete Wavelet Transform Uniform Scalar Quantization Tier-1 Tier-2 Rate-Control Coded Image Tiling DC-Level shifting Component Transform Tiling DC-Level shifting Component Transform Real mode Integer mode Real mode Integer mode Scalability Scalability Figure 1 JPEG2000 encoder architecture 2. ENCODER COMPLEXITY ANALYSIS We implement a JPEG2000 encoder on a DSP platform including two Sundance modules, SMT395 (TI TMS320C6416T DSP) and SMT310. We start with the OpenJPEG (ver.1.0) [3] reference software in C language. Then, the TI CCS (Code Composer Studio ver.3.1) [4] is used to compile the C codes and profile the encoder complexity. 2.1. Profiling results TI CCS provides many simulation tools. The C64xx CPU cycles accurate simulator assumes a flat memory system in simulating the C64xx processor. In contrast, the C6416 device cycle accurate simulator can provide an accurate simulation on the C6416 processor, peripherals, and memory system. Table I Profiling lossless encoding results using two simulators (Goldhill 512x512) Simulator C64xx % C6416 % Ratio DWT 73,327,701 7.8 552,674,115 6.7 13 % Tier1 846,100,912 90.9 7,509,481,733 91.7 11% Tier2 1,550,147 <1 15,933,932 <1 9% Others 9,475,720 1 103,399,183 1.2 9% Total 930,454,480 100 8,181,488,963 100 11% III - 329 1-4244-1437-7/07/$20.00 ©2007 IEEE ICIP 2007