IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 9, SEPTEMBER 2002 2347 A Systolic Array Architecture for the Discrete Sine Transform Doru Florin Chiper, M. N. S. Swamy, Fellow, IEEE, M. Ohmair Ahmad, Fellow, IEEE, and Thanos Stouraitis, Senior Member, IEEE Abstract—An efficient approach to design very large scale integration (VLSI) architectures and a scheme for the imple- mentation of the discrete sine transform (DST), based on an appropriate decomposition method that uses circular correlations, is presented. The proposed design uses an efficient restructuring of the computation of the DST into two circular correlations, having similar structures and only one half of the length of the original transform; these can be concurrently computed and mapped onto the same systolic array. Significant improve- ment in the computational speed can be obtained at a reduced input–output (I/O) cost and low hardware complexity, retaining all the other benefits of the VLSI implementations of the discrete transforms, which use circular correlation or cyclic convolution structures. These features are demonstrated by comparing the proposed design with some of the recently reported schemes. Index Terms—Discrete sine transform, systolic arrays, VLSI algorithms. I. INTRODUCTION T HE discrete sine transform (DST), along with the discrete cosine transform (DCT), represent the key functions used in many signal and image processing applications, especially in transform coding. For images with high correlation, the DCT yields better results; however, for images with a low correlation of coefficients, the DST yields lower bit rates [2]. The DST is signal independent and represents a good approximation of the statistically optimal Karhunen-Loeve transform [1]. The DST constitutes the basis of the recursive block coding technique [2] and is used in a fast implementation of lapped orthogonal transforms [3]. Since the DST is computationally intensive, the derivation of new efficient algorithms for its parallel very large scale in- teration (VLSI) implementation is highly desirable. The data movement and transfer play an important role in determining Manuscript received August 14, 2000; revised May 14, 2002. This work was supported in part by the Micronet National Network of Centers of Excellence, the Natural Sciences and Engineering Research Council (NSERC) of Canada, and Fonds pour la Formation des Chercheurs et l’Aide a la Recherche of Quebec. The associate editor coordinating the review of this paper and approving it for publication was Prof. Chaitali Chakrabarti. D. F. Chiper is with the Department of Applied Electronics, Technical Uni- versity “Gh. Asachi,” Iasi, Romania. M. N. S. Swamy and M. O. Ahmad are with the Center for Signal Processing and Communications, Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada H3G 1M8 (e-mail: swamy@ece.concordia.ca). T. Stouraits is with the Department of Electrical and Computer Engineering, University of Patras, Patras, Greece. Publisher Item Identifier 10.1109/TSP.2002.801940. the efficiency of a VLSI implementation of the hardware algo- rithms [4]. This explains why the use of cyclic convolution and circular correlation structures provides high computing speed, low computational complexity, and low I/O bandwidth, as have already been shown for the discrete Fourier transform (DFT) [5] and for the DCT [6]. Due to their simple and regular data flow and their easy implementation through modular and reg- ular hardware techniques, such as the distributed arithmetic [7] and systolic arrays [8], the conversion of the DST into a cyclic convolution or a circular correlation structure leads to an effi- cient solution for its VLSI implementation. In this paper, we propose a new input sequence and appro- priate index mappings to arrive at an efficient conversion of a prime-length DST into two parallel circular correlation struc- tures of one half of the original length. Substantial improvement in the processing speed of the VLSI realization is thus obtained. This realization preserves all the advantages reported in [6] for the DCT. The two circular correlation structures have the same structure and length; only the control tags and the input and output sequences are different. Their data-dependence graphs can be mapped into systolic arrays, as shown in [9]. The sys- tolic array implementations can be efficiently unified using the method proposed in [10]. There are some differences in the sign that are efficiently managed using the tag control scheme [11]. We can obtain a significant speed improvement with a slight increase in the hardware complexity compared with that of the schemes in [6], [12], and [13], preserving all the advantages of architectural topology, input–output (I/O) cost, and computa- tional complexity of the VLSI implementations of the discrete transforms that use systolic arrays, based on circular correlation structures. II. NEW ALGORITHM FOR THE DST The DST of the input sequence is defined as [1] (1) where . If the transform length is a prime-number greater than 2, we can introduce a new input sequence, which is defined as (2) 1053-587X/02$17.00 © 2002 IEEE