IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS 1 Unified Systolic-Like Architecture for DCT and DST using Distributed Arithmetic Pramod Kumar Meher, Senior Member, IEEE Abstract— A common computing-core representation of the discrete cosine transform and discrete sine transform is derived and a reduced-complexity algorithm is developed for computation of the proposed computing-core. A parallel architecture based on the principle of distributed arithmetic is designed further for the computation of these transforms using the common-core algorithm. The proposed scheme not only leads to a systolic-like regular and modular hardware for computing these transforms, but also offers significant improvement in area-time efficiency over the existing structures. The structure proposed here is devoid of complicated input/output mapping and does not involve any complex control. Unlike the convolution-based structures, it does not restrict the transform-length to be a prime or multiple of prime, and can be utilized as a reusable core for cost-effective, memory-efficient, high-throughput implementation of either of these transforms. Index Terms— Discrete cosine transform (DCT), discrete sine transform (DST), distributed arithmetic, systolic array, very large-scale integration (VLSI), digital signal processing chip I. I NTRODUCTION T HE discrete cosine transforms (DCT) and discrete sine transform (DST) have key functions in several signal and image processing applications, especially for their near opti- mal transform coding performance [1]–[3]. Since both these transforms are computation-intensive and they are frequently encountered, several algorithms have been suggested for com- puting them efficiently in general-purpose-computers [4], [5]. The general-purpose machines, however, very often do not meet the speed-requirement of various real-time applications and size-constraints of many portable systems. Considerable importance is, therefore, attached to the design of dedicated hardware architectures for fast and efficient calculation of these transform components. It is also observed further that the algorithms designed for software-implementation are not well-suited for dedicated hardware. Parallel algorithms and architectures are, therefore, imperative for efficient realization of these transforms in VLSI structures. Appropriate algorithm design has a major role on developing a hardware entity that can satisfy the system requirements and specifications. Not only it should lead to reduction of computational-complexity, but also should facilitate maximization of concurrency to Manuscript submitted on July 23, 2005. Revised November 25, 2005 and June 20, 2006. This paper was recommended by Associate Editor Zhongfeng Wang. Author is with the School of Computer Engineering, Nanyang Techno- logical University, 50 Nanyang Avenue, Singapore, 639798. Email: aspkme- her@ntu.edu.sg, URL: http://www.ntu.edu.sg/home/aspkmeher/. Copyright (c) 2006 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. achieve high-throughput performance. Moreover, the archi- tecture should be developed in synergy with the underlying algorithm to derive an area-time-efficient VLSI system. Systolic designs represent a popular class of architec- tural solution for efficient VLSI implementation of high- performance digital signal processing (DSP) applications [6]. Several systolic array architectures are, therefore, suggested for efficient computation of the DCT and the DST [7]–[12]. The multipliers in these structures, however, use a large portion of the chip-area, and consequently enforce stringent limitation on the maximum possible number of PEs to be used and the maximum transform-size to be implemented. Memory-based techniques have gained substantial popularity, in the recent years, due to their low hardware-complexity, high-throughput processing, and increased regularity resulting in cost-effective and efficient VLSI structures [13]–[23]. There are two basic techniques for memory-based hardware realization. One of the techniques is direct-ROM-based im- plementation of multiplications [15]–[17], while the other is based on distributed arithmetic (DA) [18]–[22]. The DA-based implementation is applicable for calculation of inner-products when any one of the vectors/ sequences is fixed. It yields faster output compared with the multiplier-accumulator-based designs because it stores the pre-computed partial results of inner-products in the memory elements. Due to its efficiency in VLSI implementation, DA-principle is widely used in various DSP applications, and has also been utilized to realize many commercial products [23] as well. In the direct-ROM-based implementations, the multipliers used for multiplication of input values with the fixed transform kernel coefficients, are replaced by a ROM-based look-up-table (LUT) of size 2 L ,(L is the word-length) where each of the ROM tables contains the pre-computed product values for all possible values of input samples. This technique is used to implement the DCT and the DST in linear systolic arrays, after converting the transform in to a circular convolution or convolution-like form [15]–[17]. It involves less hardware-complexity compared with the DA-based method, when the word length is less than the transform-length, but on the other hand, its latency, as well as, the average computation time (ACT) increase proportionately with the transform-size. Besides, it also requires the transform- length to be a prime number to make it possible to convert the DCT or the DST into circular-convolution structure. Both these memory-based techniques have some advantages and disadvantages over one another. Some of the advantages of the DA-based method over the other are: 1) It does not restrict the transform-length to be a prime number or a multiple of prime.