IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 9, SEPTEMBER 2002 2347
A Systolic Array Architecture for the
Discrete Sine Transform
Doru Florin Chiper, M. N. S. Swamy, Fellow, IEEE, M. Ohmair Ahmad, Fellow, IEEE, and
Thanos Stouraitis, Senior Member, IEEE
Abstract—An efficient approach to design very large scale
integration (VLSI) architectures and a scheme for the imple-
mentation of the discrete sine transform (DST), based on an
appropriate decomposition method that uses circular correlations,
is presented. The proposed design uses an efficient restructuring
of the computation of the DST into two circular correlations,
having similar structures and only one half of the length of
the original transform; these can be concurrently computed
and mapped onto the same systolic array. Significant improve-
ment in the computational speed can be obtained at a reduced
input–output (I/O) cost and low hardware complexity, retaining
all the other benefits of the VLSI implementations of the discrete
transforms, which use circular correlation or cyclic convolution
structures. These features are demonstrated by comparing the
proposed design with some of the recently reported schemes.
Index Terms—Discrete sine transform, systolic arrays, VLSI
algorithms.
I. INTRODUCTION
T
HE discrete sine transform (DST), along with the discrete
cosine transform (DCT), represent the key functions used
in many signal and image processing applications, especially in
transform coding. For images with high correlation, the DCT
yields better results; however, for images with a low correlation
of coefficients, the DST yields lower bit rates [2]. The DST is
signal independent and represents a good approximation of the
statistically optimal Karhunen-Loeve transform [1]. The DST
constitutes the basis of the recursive block coding technique
[2] and is used in a fast implementation of lapped orthogonal
transforms [3].
Since the DST is computationally intensive, the derivation
of new efficient algorithms for its parallel very large scale in-
teration (VLSI) implementation is highly desirable. The data
movement and transfer play an important role in determining
Manuscript received August 14, 2000; revised May 14, 2002. This work was
supported in part by the Micronet National Network of Centers of Excellence,
the Natural Sciences and Engineering Research Council (NSERC) of Canada,
and Fonds pour la Formation des Chercheurs et l’Aide a la Recherche of Quebec.
The associate editor coordinating the review of this paper and approving it for
publication was Prof. Chaitali Chakrabarti.
D. F. Chiper is with the Department of Applied Electronics, Technical Uni-
versity “Gh. Asachi,” Iasi, Romania.
M. N. S. Swamy and M. O. Ahmad are with the Center for Signal
Processing and Communications, Department of Electrical and Computer
Engineering, Concordia University, Montreal, QC, Canada H3G 1M8 (e-mail:
swamy@ece.concordia.ca).
T. Stouraits is with the Department of Electrical and Computer Engineering,
University of Patras, Patras, Greece.
Publisher Item Identifier 10.1109/TSP.2002.801940.
the efficiency of a VLSI implementation of the hardware algo-
rithms [4]. This explains why the use of cyclic convolution and
circular correlation structures provides high computing speed,
low computational complexity, and low I/O bandwidth, as have
already been shown for the discrete Fourier transform (DFT)
[5] and for the DCT [6]. Due to their simple and regular data
flow and their easy implementation through modular and reg-
ular hardware techniques, such as the distributed arithmetic [7]
and systolic arrays [8], the conversion of the DST into a cyclic
convolution or a circular correlation structure leads to an effi-
cient solution for its VLSI implementation.
In this paper, we propose a new input sequence and appro-
priate index mappings to arrive at an efficient conversion of a
prime-length DST into two parallel circular correlation struc-
tures of one half of the original length. Substantial improvement
in the processing speed of the VLSI realization is thus obtained.
This realization preserves all the advantages reported in [6] for
the DCT. The two circular correlation structures have the same
structure and length; only the control tags and the input and
output sequences are different. Their data-dependence graphs
can be mapped into systolic arrays, as shown in [9]. The sys-
tolic array implementations can be efficiently unified using the
method proposed in [10]. There are some differences in the sign
that are efficiently managed using the tag control scheme [11].
We can obtain a significant speed improvement with a slight
increase in the hardware complexity compared with that of the
schemes in [6], [12], and [13], preserving all the advantages of
architectural topology, input–output (I/O) cost, and computa-
tional complexity of the VLSI implementations of the discrete
transforms that use systolic arrays, based on circular correlation
structures.
II. NEW ALGORITHM FOR THE DST
The DST of the input sequence is
defined as [1]
(1)
where . If the transform length is a prime-number
greater than 2, we can introduce a new input sequence, which is
defined as
(2)
1053-587X/02$17.00 © 2002 IEEE