Journal of Signal Processing Systems https://doi.org/10.1007/s11265-018-1387-2 Parallel Memory Accessing for FFT Architectures V. Kitsakis 1 · K. Nakos 1 · D. Reisis 1 · N. Vlassopoulos 1 Received: 1 December 2017 / Revised: 8 March 2018 / Accepted: 30 May 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract The current paper introduces an efficient technique for parallel data addressing in FFT architectures performing in-place computations. The novel addressing organization provides parallel load and store of the data involved in radix-r butterfly computations and leads to an efficient architecture when r is a power of 2. The addressing scheme is based on a permutation of the FFT data, which leads to the improvement of the address generating circuit and the butterfly processor control. More- over, the proposed technique is suitable for mixed radix applications, especially for radixes that are powers of 2 and straight- forward continuous flow implementation. The paper presents the technique and the resulting FFT architecture and shows the advantages of the architecture compared to hitherto published results. The implementations on a Xilinx FPGA Virtex-7 VC707 of the in-place radix-8 FFT architectures with input sizes 64 and 512 complex points validate the results. Keywords FFT · Parallel memory access · In-place architecture · FPGA implementation 1 Introduction The evolving applications in the areas of signal processing and telecommunications demand FFT computations per- formed at high speed with minimal resources. FFT architec- tures targeting low cost implementations include a radix-b butterfly processor and a memory storing the N input points, which by the use of in-place techniques stores also the results of the FFT intermediate and output stages. Speeding up the computations can be achieved by including b mem- ory banks and an addressing scheme, which loads and stores in parallel the b input and the b output data of each radix-b  D. Reisis dreisis@phys.uoa.gr V. Kitsakis bkits@phys.uoa.gr K. Nakos knakos@phys.uoa.gr N. Vlassopoulos nvlassop@phys.uoa.gr 1 Department of Physics, Electronics Laboratory, National and Kapodistrian University of Athens, Physics Building. IV, Panepistimiopolis, 157-84 Athens, Greece butterfly computation [1–4]. The architecture becomes more efficient by minimizing the cost of the circuits generating and routing the addresses of the data fetched in parallel and also the cost of the circuits generating the related twiddles. Parallelizing the load and store operations of the butterfly data has been studied in [3–7, 9, 10, 12, 13]. The author of [3] gave a solution for radix-b FFT computations, which includes b memory banks and performs an initial data distribution in the b banks with a complex address generation circuit. The solution for radix-2 presented in [4] uses output registers to resolve the conflict while storing the results of the butterfly. Reisis and Vlassopoulos [7] showed the set of permutations that provide a solution to the parallel access in N points FFT computations with radix-b and b banks and require log b 2 N bit LUTs for realizing the permutations. A technique based on the stride-permutation is presented in [6] without proof and it requires complexity for the address generation equal to that in [3]. Related work proving that the stride permutation can be used to minimize the number of the required adders in the address generation for streaming applications, including the FFT, is presented in [11]. Techniques for radix-2 FFTs are reported in [5, 8]. [5] shows a heuristic approach and [8] introduces a parallel addressing scheme exploiting the Gray code properties. The authors of [12] present an improved architecture in the case of real value FFT.