240 PRZEGLĄD ELEKTROTECHNICZNY (Electrical Review), ISSN 0033-2097, R. 88 NR 6/2012 Michał STAWORKO, Mariusz RAWSKI Politechnika Warszawska, Instytut Telekomunikacji Application of Modified Distributed Arithmetic Concept in FIR Filter Implementations Targeted at Heterogeneous FPGAs Abstract. Distributed arithmetic is a very efficient method for implementing digital FIR filters in FPGA structures. In this approach general purpose multipliers of traditional MAC implementations are replaced by combinational LUT blocks. Since LUT blocks can be of considerable size thus, the quality of digital filter implementation highly depends on efficiency of logic synthesis algorithm that maps it into FPGA resources. Modern FPGAs have heterogeneous structure, there is a need for quality algorithms to target these structures and the need for flexible architecture exploration aiding in appropriate mapping. The paper presents an application of modified distributed arithmetic concept that allows for very efficient implementation of FIR filters in heterogeneous FPGA architectures. Streszczenie. Arytmetyka rozproszona jest bardzo wydajną metodą implementacji filtrów SOI w układach FPGA. Pozwala na zastąpienie kosztowych układów mnożących tablicami prawdy (LUT). Dla filtrów wysokich rzędów tablice LUT osiągają wielkie rozmiary, dlatego jakość implementacji filtru zależy głównie od jakości dekompozycji tej tablicy. Artykuł przedstawia nową metodę dekompozycji tablic LUT filtrów SOI dedykowaną do heterogenicznych stukrur rekonfigurowalnych. (Zastosowanie metody zmodyfikowanej arytmetyki rozproszonej do implementacji filtrów SOI w heterogenicznych układach FPGA). Keywords: modified distributed arithmetic, FIR filters, heterogeneous programmable structures, logic synthesis. Słowa kluczowe: zmodyfikowana arytmetyka rozproszona, filtry SOI, heterogeniczne struktury programowalna, synteza logiczna. Introduction Digital Signal Processing (DSP), thanks to explosive growth in wired and wireless networks and in multimedia, represents one of the hottest areas in electronics. The applications of DSP continue to expand, driven by trends such as the increased use of video and still images and the demand for increasingly reconfigurable systems such as Software Defined Radio (SDR). Many of these applications combine the need for significant DSP processing with cost sensitivity, creating demand for high-performance, low-cost DSP solutions. In recent years digital filters have been recognized as primary digital signal processing operation. With advances in digital technology they are rapidly replacing analogue filters, which were implemented with RLC components. Digital filters are used to modify attributes of signal in the time or frequency domain trough the process called linear convolution. They are typically implemented as multiply- accumulate (MAC) algorithms with use of special DSP devices [1, 2, 3]. Such devices are based on the concept of RISC processors with an architecture consisting of fast array multipliers. General-purpose DSP chips combine efficient implementations of these functions with a general- purpose microprocessor. The number of multipliers is generally in the range of one to four, and the microprocessor will sequence data to pass it through the multiply and other functions, storing intermediate results in memory or accumulators. Performance is increased primarily by increasing the clock speed used for multiplication. By using pipeline architecture the speed of such implementation is limited by the speed of array multiplier. Typical clock speeds run from tens of MHz to 1GHz. Performance, as measured by millions of Multiply And Accumulate (MAC) operations per second, typically ranges from 10 to 4000. Field Programmable Gate Arrays (FPGAs), with their newly acquired digital signal processing capabilities, are now expanding their roles to help offload computationally intensive digital signal processing functions from the processor. Progress in development of programmable architectures observed in recent years resulted in digital devices that allow building very complex digital circuits and systems at relatively low cost in a single programmable structure. Programmable technology, however, provides possibility to increase the performance of digital system by exploitation of parallelisms of implemented algorithms. This technology allows also application of special techniques such as distributed arithmetic (DA) [4, 5]. Distributed arithmetic is an important technique to implement digital signal processing (DSP) functions in FPGAs [1]. It provides an approach for multiplier-less implementation of DSP systems, since it is an algorithm that can perform multiplication with use of lookup table (LUT) that stores the precomputed values and can be read out easily, which makes DA-based computation well suited for FPGA realization, because the LUT is the basic component of FPGA. DA specifically targets the sum of products computation that is found in many of the important DSP filtering and frequency transforming functions. DA concept proves to be a powerful technique for implementing MAC unit as a multiplierless algorithm. The efficiency of implementations based on this concept and targeted FPGAs strongly depend on implementation of DA- LUT. These blocks have to be efficiently mapped onto FPGA’s logic resources. The major disadvantage of DA technique is that the size of DA-LUT increases exponentially with the length of input. Several efforts have been made to reduce the DA-LUT size for efficient realization of DA-based designs. In [2] to use offset-binary coding is proposed to reduce the DA-LUT size by a factor of 2. Recently, a new DA-LUT architecture for high-speed high-order has been introduced in [6], where the major disadvantage of the FIR filters is vanished by using carry lookahead adder and the tri-state buffer. On the other side, some structures are introduced for efficient realization of FIR filter. Recently, novel one- and two-dimensional systolic structures are designed for computation of circular convolution using DA [7], where the structures involve significantly less area-delay complexity compared with the other existing DA-based structures for circular convolution. In [8] modified DA architecture is used to obtain an area- time-power-efficient implementation of FIR filter in FPGA. With rapidly growing of traditional FPGA industry, heterogeneous logic blocks are often used in the actual FPGA architectures such as Xillinx Virtex-5 and Altera Stratix III series. How to handle this kind of heterogeneous design network to generate LUTs with different input sizes in the mapping is a very important and practical problem. The existing CAD tools are not well suited to utilize all possibilities that modern heterogeneous programmable