FPGA-based architecture for the real-time computation of 2-D convolution with large kernel size F. Javier Toledo-Moreo ⇑ , J. Javier Martínez-Alvarez, Javier Garrigós-Guerrero, J. Manuel Ferrández-Vicente Dpto. Electrónica y Tecnología de Computadoras, Universidad Politécnica de Cartagena, Spain article info Article history: Received 15 November 2011 Received in revised form 28 April 2012 Accepted 14 June 2012 Available online 26 June 2012 Keywords: 2-D Convolution Large kernel size FPGA Embedded and real-time systems abstract Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the FPGA-based implementation of 2-D convolution with medium–large kernels. It is a multiplierless solution based on Distributed Arithmetic implemented using general purpose resources in FPGAs. Our proposal is modular and coefﬁcient independent, so it remains fully ﬂexible and customizable for any application. The architecture design includes a control unit to manage efﬁciently the operations at the borders of the input array. Results in terms of occupied resources and timing are reported for different conﬁgurations. We compare these results with other approaches in the state of the art to validate our approach. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction Bidimensional convolution is a basic tool in many areas. Convo- lution with a kernel or template has been widely used in image or video processing for ages for spatial domain ﬁltering in low-level processing stages with the aim of manipulating data, extracting information of interest, enhancing or nulling speciﬁc characteris- tics, for patterns recognition or other higher-level goals [1,2]. Now- adays more than ever, the ubiquity of cameras in any context in devices like smartphones and tablets, surveillance systems, auto- mobiles, etc. has given rise to a plethora of embedded video processing systems for nearly every conceivable application, and 2-D convolution is a fundamental operation in most of them. More- over, 2-D convolution is also on the basis of machine learning algorithms [3] or biologically-inspired models [4–7]. For some applications like simple edge detection, image smooth- ing or sharpening, convolution kernels from small 3  3 up to 7  7 are typically fair enough, as evidences the 5  5 customizable mask for image ﬁltering available in the so-popular user-level Adobe Photoshop Ò software. For other applications like object tracking, estimation ﬁltering, physiological modeling or pattern recognition, larger kernels can be more interesting (e.g. [8–12]). Although conceptually simple, the computation of 2-D convolu- tion, given by the sum of products (1), is not trivial: with a M  N kernel, it requires M  N multiplications and M  N  1 additions, besides M  N accesses to the input data, for the calculation of a single output. Oðx; yÞ¼ X M 2 i¼ M 2 þ1 X N 2 j¼ N 2 þ1 hði; jÞIðx þ j; y þ iÞ ð1Þ This implies that more than 1.35 Giga-operations per second (GOPs) are required to support a real-time processing rate of 1280  720 30 frames per second (fps) HD video with a 5  5 kernel, more than 12.41 GOPs with a 15  15 one. Clearly, the computational load and the complexity of memory access grow exponentially as the kernel dimensions increase. Historically, the kernel size has been constrained to a small bounded range because of the inefﬁciency of CPU-based software approaches to carry out such a huge number of operations. Trans- formation into the frequency domain and computing the convolu- tion as the inverse FFT of the multiplication of the FFTs of both input and kernel is a signiﬁcant improvement for software imple- mentation, but only through parallel processing it is possible to achieve the desired throughput for real time applications. FPGA devices, and more recently GPUs, offer the chance. In last few years, Field Programmable Gate Array (FPGA) devices have been the predominant hardware platform used to compute 2-D convolution due to their ﬁne grain parallelism and reconﬁgu- rability. FPGA internal structure makes itself perfectly suitable for successfully exploiting pixel-level parallelism inherent to low-le- vel image processing algorithms, like the local neighborhood func- tion deﬁned by Eq. (1), instruction-level parallelism by means of pipelining, as well as, at higher level, task parallelism (e.g. multiple 1383-7621/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sysarc.2012.06.002 ⇑ Corresponding author. E-mail address: javier.toledo@upct.es (F. Javier Toledo-Moreo). Journal of Systems Architecture 58 (2012) 277–285 Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc