Low-power and high-quality Cordic-based Loeffler DCT for signal processing C.-C. Sun, S.-J. Ruan, B. Heyne and J. Goetze Abstract: A computationally efficient and high-quality preserving discrete cosine transform (DCT) architecture is presented. It is obtained by optimising the Loeffler DCT based on the coordinate rotation digital computer (Cordic) algorithm. The computational complexity is reduced significantly from 11 multiply and 29 add operations (Loeffler DCT) to 38 add and 16 shift operations (i.e. similar to the complexity of the binDCT) without losing quality. After synthesising with TSMC 0.13-mm technology library, Synopsys PrimePower was used to esti- mate the power consumption at gate-level. The experimental results show that the proposed 8-point one-dimensional DCT architecture only consumes 19% of the area and about 16% of the power of the original Loeffler DCT. Moreover, it also retains the good transformation quality of the original Loeffler DCT. In this regard, the proposed Cordic-based Loeffler DCT is very suitable for low-power and high-quality encoder/decoders (codecs) used in battery-based systems. 1 Introduction Recently, many kinds of digital image processing and video compression techniques have been proposed in the litera- ture, such as joint photographic experts group (JPEG), digital watermark, moving picture experts group (MPEG) and H.263 [1–3]. All of the aforementioned standards require DCT [1] to aid image/video compression. Therefore the DCT has become more and more important in today’s image/video processing designs. In the past few years, much research has been done on low- power DCT designs [4–11]. In consideration of VLSI-implementation, the flow-graph algorithm (FGA) is the most popular way to realise the fast DCT (FDCT) [12, 13]. In 1989, Loeffler et al. [14] proposed a low-complexity FDCT/IDCT algorithm based on FGA that requires only 11 multiply and 29 add operations. However, the multiplications consume about 40% of the power and almost account for 45% of the total area [15]. Thus, Tran [16–18] proposed the binDCT which approximates multiplications with add and shift operations. Later, an efficient VLSI architecture and implementation of the binDCT was presented in [19]. Although the binDCT reduces the computational complexity significantly, it suffers from losing about 2 dB in PSNR com- pared to the Loeffler DCT [15]. Coordinate rotation digital computer (Cordic) is an algor- ithm which is used to evaluate many functions and appli- cations in signal processing [20, 21]. In addition, the Cordic algorithm is highly suited for VLSI-implementation. Therefore, Jeong et al. [9] proposed a Cordic-based implementation of the DCT which only requires 104 add and 84 shift operations to realise a multiplierless transform- ation yielding the same transformation quality as the Loeffler DCT. Yu and Swartzlander [22] presented a scaled-DCT architecture based on the Cordic algorithm which requires two multiply and 32 add operations. However, this DCT architecture needs additional three Cordic rotations at the end of the flow graph to perform the multiplierless transformation. Therefore both the Cordic-based DCT and the scaled-Cordic-based DCT need more operations than the binDCT does [17] to carry out an exact transformation. In this paper, we propose a computationally efficient and high-quality Cordic-based Loeffler DCT architecture, which is optimised by taking advantage of certain properties of the Cordic algorithm and its implementations [23]. On the basis of the special properties of the Cordic algorithm, we opti- mise the Cordic-based Loeffler DCT by ignoring some unnoticeable iterations and shifting the compensation steps of each angle to the quantiser. The computational complexity is reduced from 11 multiply and 29 add oper- ations (Loeffler DCT) to 38 add and 16 shift operations (which is almost the same complexity as for the binDCT). Moreover, the experimental results show that the presented Cordic-based Loeffler DCT architecture only occupies 19% area and consumes about 16% power of the original Loeffler DCT. Furthermore, it reduces the power dissipation to about 42% of that of the binDCT. On the other hand, it also retains the good transformation quality of the Loeffler DCT in PSNR simulation results. This paper focuses on the 8-point implementation of the DCT, but it can be general- ised to any size. This paper is organised as follows. Section 2 briefly intro- duces the algorithms of the DCT, Loeffler DCT and Cordic-based DCT. In Section 3, we will present the pro- posed Cordic-based Loeffer DCT algorithm. The exper- imental results are shown in Section 4, and Section 5 concludes this paper. # The Institution of Engineering and Technology 2007 doi:10.1049/iet-cds:20060289 Paper first received 22nd September 2006 and in final revised form 5th June 2007 C.-C. Sun and S.-J. Ruan are with Laboratory for Low-Power system, Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan, Republic of China B. Heyne and J. Goetze are with University of Dortmund, Information Processing Lab, Otto-Hahn-Str. 4, Dortmund 44221, Germany E-mail: sjruan@et.ntust.edu.tw IET Circuits Devices Syst., 2007, 1, (6), pp. 453–461 453