This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1 Architecture of a Fully Pipelined Real-Time Cellular Neural Network Emulator Nerhun Yildiz, Member, IEEE, Evren Cesur, Member, IEEE, Kamer Kayaer, Vedat Tavsanoglu, Senior Member, IEEE, and Murathan Alpay Abstract—In this paper, architecture of a Real-Time Cellular Neural Network (CNN) Processor (RTCNNP-v2) is given and the implementation results are discussed. The proposed architecture has a fully pipelined structure, capable of processing full-HD 1080p@60 (1920 1080 resolution at 60 Hz frame rate, 124.4 MHz visible pixel rate) video streams, which is implemented on both high-end and low-cost FPGA devices, Altera Stratix IV GX 230, and Cyclone III C 25, respectively. Many features of the architecture are designed to be either pre-synthesis congurable or runtime programmable, which makes the processor extremely exible, reusable, scalable, and practical. Index Terms—Cellular neural networks, eld programmable gate arrays, real time systems, recongurable architectures. I. INTRODUCTION C ELLULAR neural networks (CNN) is a parallel com- puting paradigm [1] having many applications like image processing, articial vision, solving partial differential equations, etc. A -dimensional -layer CNN structure consists of a -dimensional spatial grid of neural cells and each cell contains memory nodes. The spatio-temporal dynamics of the system are tuned for specic tasks by dening local spatial interconnections between the neural cells. Generally, a 2-D 1-layer CNN structure with space invariant neural weights [2] is used in image processing applications, which is the focus of this work. Extending the architecture pro- posed in this paper to support two- or multi-layer CNN struc- tures is an ongoing work and beyond the scope of this paper. A continuous-time CNN (CT CNN) implementation has many advantages: a continuous-time circuit is by nature a fully parallel structure, whose convergence rate is generally much faster than that of a digital approximation. Furthermore, it is easier to combine the architecture with an imaging sensor and obtain a focal plane processor to directly process the captured data and use it as a pre-processor or articial retina. However, Manuscript received March 25, 2014; revised June 11, 2014; accepted July 15, 2014. This research was supported by The Scientic and Technological Re- search Council of Turkey (TÜBİTAK) under project number 108E023. This paper was recommended by Associate Editor M. Frasca. N. Yildiz and M. Alpay are with the Department of Electronics and Com- munications Engineering, Yildiz Technical University, 34220 Esenler, Istanbul, Turkey (e-mail: nerhuny@yildiz.edu.tr; ecesur@yildiz.edu.tr; malpay@yildiz. edu.tr). E. Cesur was with the Department of Electronics and Communications En- gineering, Yildiz Technical University, 34220, Esenler, Istanbul, Turkey. He is now with the Applied DSP and VLSI Research Group, University of Westmin- ster, W1W 6UW, London, U.K. K. Kayaer is with the Scientic and Technological Research Council of Turkey, 41470 Gebze, Kocaeli, Turkey (e-mail: kamerkayaer@gmail.com). V. Tavsanoglu is with the Department of Electrical and Electronics Engineering, Isik University, 34398 Maslak, Istanbul, Turkey (e-mail: vtavsanoglu@isik.edu.tr). Digital Object Identier 10.1109/TCSI.2014.2345502 the highest number of cells implemented in a CT CNN pro- cessor to date is 176 144 [3], hence even a low resolution input comparable to QVGA (320 240) may only be pro- cessed by tiling, i.e., divide the image to smaller overlapped “tiles” and process them individually. Furthermore, tiling is not always reliable for some CNN templates, hence for large images these networks can only be simulated or emulated on a digital platform. Second, bit depth of a CT CNN is limited to 7 bits due to the electrical noise and crosstalk of an analog implementation. Consequently, even obtaining a regular 256 level gray-scale result is not possible with CT CNN. Finally, as opposed to a digital implementation, modifying an analog IC design is an extremely comprehensive work, which can almost be considered as a new project. As a result, digital implementations of CNN are preferable in most cases. The difference equation of the discrete-time CNN (DT CNN) is obtained by the discretization of the differential equation of the CT CNN. Then the difference equation may be solved on a software platform like a PC, DSP, or GPU, or a custom hard- ware can be implemented either on an FPGA device or as ASIC. Software solutions are easier to design and modify while hard- ware implementations provide high performance. Using an FPGA device for a DT CNN implementation is preferable in most cases, as it has exible parallel structures, faster than software implementations and cheaper than ASIC solutions. Consequently, the most notable DT CNN implemen- tations [4], [5] are implemented on FPGA devices. An alterna- tive FPGA architecture of DT CNN was proposed in [6], which is named as real-time CNN processor (RTCNNP, RTCNNP-v1). The architecture proposed in this paper is a second-generation RTCNNP called RTCNNP-v2 [7], [8]. The aim of this work is to design a real-time DT CNN implementation supporting not only higher frame-rates, but also high resolutions, including full-HD 1080p. II. MATHEMATICAL OVERVIEW An -neighborhood space-invariant CT CNN with rectangular array of cells is completely described [2] by the cell-state and output equation pair (1) (2) where , , are the spatial Cartesian coordinates, is the cell state at time , is the cell input, and , , are the feedback and input coefcients, respectively, and is the 1549-8328 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.