SUBMITTED TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1 Parallel and Pipelined Architectures for Cyclic Convolution by Block Circulant Formulation using Low-Complexity Short-Length Algorithms Pramod Kumar Meher, Senior Member, IEEE Abstract— Fully-pipelined parallel architectures are derived for high-throughput and reduced-hardware realization of prime- factor cyclic convolution using hardware-efﬁcient modules for short-length rectangular transform (RT). Moreover, a new ap- proach is proposed for the computation of block pseudo-cyclic convolution using a block cyclic convolution of equal length along with some correction terms, so that the block pseudo- cyclic representation of cyclic convolution for non-prime-factor- length (N = rP , when r and P are not mutually prime) could be computed efﬁciently using the algorithms and architectures of short-length cyclic convolutions. Low-complexity algorithms are derived for efﬁcient computation of those error terms; and overall complexities of the proposed technique are estimated for r =2, 3, 4, 6, 8 and 9. The proposed algorithms are used further to design high-throughput and reduced-hardware structures for cyclic convolution where the co-factors are not relatively prime. The proposed structures for high-throughput implementation are found to offer a reduction of nearly 50 to 75% of area-delay prod- uct over the existing structures for several convolution lengths. Low-complexity structures for input/output addition units of short convolution-lengths are derived and used them along with high-throughput modules for hardware-efﬁcient realization of multi-factor convolution which offers nearly 25 to 75 % reduction of area-delay complexity over the existing structures for various non-prime-factor convolution lengths. Index Terms— Cyclic convolution, block cyclic convolution, pseudo-cyclic convolution, systolic array, VLSI. I. I NTRODUCTION Cyclic convolution is used as a basic tool in digital signal and image processing applications [1]. Various sinusoidal transforms like the discrete cosine transform (DCT), dis- crete sine transform (DST), discrete Fourier transform (DFT), and discrete Hartley transform (DHT) could be converted into cyclic convolutional form and computed efﬁciently by fast cyclic convolution algorithms [2]–[8]. Cyclic convolution could also be used for efﬁcient computation of block linear convolution for parallel implementation of ﬁnite impulse re- sponse (FIR) ﬁlters. Since the computation of sinusoidal trans- forms and FIR ﬁltering not only are computation-intensive but also they are encountered frequently as integral part of many video-processing algorithms, it is of high importance to design dedicated hardware for area- and time-efﬁcient implementation of cyclic convolution. Several attempts have been made in recent years for implementation of DFT, DCT, DST and DHT through cyclic convolutional formulation in systolic and The author is with the School of Computer Engineering, Nanyang Techno- logical University, 50 Nanyang Avenue, Singapore, 639798, E-mail: aspkme- her@ntu.edu.sg. URL: http://www.ntu.edu.sg/home/aspkmeher/ Copyright (c) 2008 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. systolic-like hardware [2]–[8] due to the remarkable advan- tages of convolution-based designs. The systolic designs are attractive for VLSI and FPGA implementation, but they involve multipliers and adders of O(N ) to compute an N -point cyclic convolution. To have lower area-delay complexity over the systolic designs, Cheng and Parhi [8] have recently suggested a VLSI architecture for computation of cyclic convolution of length N = rP , and have used that for efﬁcient computation of the DCT. Interestingly the structure of [8] has less theoretical restric- tion on the convolution-length because r and P may not be relatively prime. To have such ﬂexibility of convolution length, they have converted the convolution into a pseudo- cyclic convolution, where the pseudo-cyclic convolution is computed iteratively by their 2-point and 3-point algorithms for linear convolution and Agarwal-Cooley algorithms for short-length cyclic convolution [9]. The structures of [8] are efﬁcient in terms of the number of multipliers and adders compared with the systolic designs, but they involve high control and communication complexities, particularly for large convolution lengths. Moreover, they can be used for the applications where throughput requirement is not very high because, the computation of all the P -point sub-convolutions are multiplexed into a single P -point convolution-unit; and the resulting output are added sequentially. It is found that the structures of [8] for length N = rP require r number of circular-shifters to rotate the P -point blocks circularly before being added or subtracted to compute the desired output. Not only this will result in long wiring for data-communication, but also will demand for increase in control-complexity for implementing convolution of long and multi-factor lengths. In this paper, we therefore aim at presenting an alternative formulation for computation of composite-length convolution N = rP (for r and P being not relatively prime), where the block pseudo-cyclic convolution is converted in to a block- cyclic convolution with some correction terms. The rest of the paper is organized as follows: The mathe- matical formulation to convert prime-factor convolution in to block-cyclic convolution and efﬁcient computation of block cyclic convolution using short-length rectangular transform (RT) algorithm [9] is discussed in Section-II. Simple and modular pipelined architectures for short-length cyclic convo- lution and block cyclic convolution are derived in Section-III. The conversion of block pseudo-cyclic convolution to block cyclic convolution and the architectures of non-prime-factor convolution are described in Section-IV. Hardware- and time- complexities are discussed in Section-V and conclusions are presented in Section-VI.