Multiplier-less VLSI architecture for real-time computation of multi-dimensional convolution Ming Z. Zhang, Hau T. Ngo, Vijayan K. Asari * Computational Intelligence and Machine Vision Laboratory, Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529, USA Available online 28 August 2006 Abstract A VLSI eﬃcient multiplier-less architecture for real-time computation of multi-dimensional convolution is presented in this paper. The new architecture performs computations in the logarithmic domain by utilizing novel multiplier-less log 2 and inverse-log 2 modules which are capable of converting the fraction numbers currently not available in the literature. An eﬀective data handling strategy is devel- oped in conjunction with the logarithmic modules to eliminate the necessity of multipliers in the architecture. The proposed approach reduces hardware resources signiﬁcantly compared to other approaches maintaining a high degree of accuracy. The architecture is devel- oped as a combined systolic-pipelined design that produces an output in every clock cycle after an initial latency of 93.19 uSec. The archi- tecture is capable of operating with a clock frequency of 99 MHz based on Xilinx’s Virtex II 2v2000ﬀ896-4 FPGA and the throughput of the system is observed as 99 MOPS (million outputs per second). Error analysis performed with the FPGA-based system in the image processing examples of edge detection and noise ﬁltering shows that the proposed architecture produces outputs similar to that obtained by software simulation using Matlab. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Multi-dimensional convolution; Multiplier-less architecture; Logarithmic domain computation; Systolic-pipelines architecture; FPGA-based implementation 1. Introduction Convolution is one of the many computationally inten- sive yet fundamental operations in digital signal processing applications which include speech processing, digital com- munications, digital image and video processing. General purpose processors can be used to perform the convolution operation; however, these processors do not fully exploit the parallelism inherent in this operation. In addition, the kernel size is usually limited to a small bounded range to sustain real-time throughput. Dedicated hardware units are good for high speed processing and large kernel size, but these units usually compromise the ﬂexibility of the architecture by adapting the design to speciﬁc kernel coef- ﬁcients. Hence the architecture is only applicable to a spe- ciﬁc transfer function. It is desirable to ﬁnd optimal designs to reduce hardware resources and power consumption while supporting a wide range of kernel coeﬃcients for dif- ferent characteristics of the transfer functions. The deﬁni- tion of N-dimensional convolution O = W * I in general can be expressed as Oðm 1 ; m 2 ; ... m N Þ¼ X a 1 j 1 ¼a 1  X a 2 j 2 ¼a 2 ... X a N j N ¼a N W ðj 1 ; j 2 ; ... j N Þ  I ðm 1  j 1 ; m 2  j 2 ; ... m N  j N Þ; ð1Þ where a i ¼ J i 1 2 ,0 6 m i 6 M i  1, 1 6 i 6 N and W is the kernel function. The computational complexity is (O Q N i¼1 M i  Q N i¼1 J i ). For N = 2, the complexity is in the order O(M 1 · M 2 · J 1 · J 2 ), where in image processing 0141-9331/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2006.07.004 * Corresponding author. Tel.: +1 757 683 3752. E-mail addresses: mzhan002@odu.edu (M.Z. Zhang), hngox001@ odu.edu (H.T. Ngo), vasari@odu.edu (V.K. Asari). www.elsevier.com/locate/micpro Microprocessors and Microsystems 31 (2007) 25–37