Voltage Scaling for Partitioned Systolic Array in A Reconﬁgurable Platform Rourab Paul 1 , Sreetama Sarkar 2 , Suman Sau 3 , Koushik Chakraborty 4 , Sanghamitra Roy 4 and Amlan Chakrabarti 5 Computer Science & Engineering, Siksha ’O’ Anusandhan, Odisha, India 1 Electrical and Computer Engineering Technical University Munich, Germany 2 Computer Science & Information Technology, Siksha ’O’ Anusandhan, Odisha, India 3 Dept. Electrical and Computer Engineering, Utah State University, Logan, USA 4 School of IT, University of Calcutta Kolkata, India 5 rourabpaul@soa.ac.in 1 Abstract—The exponential emergence of Field Programmable Gate Array (FPGA) has accelerated the research of hardware implementation of Deep Neural Network (DNN). Among all DNN processors, domain speciﬁc architectures, such as, Google’s Ten- sor Processor Unit (TPU) have outperformed conventional GPUs. However, implementation of TPUs in reconﬁgurable hardware should emphasize energy savings to serve the green computing requirement. Voltage scaling, a popular approach towards energy savings, can be a bit critical in FPGA as it may cause timing failure if not done in an appropriate way. In this work, we present an ultra low power FPGA implementation of a TPU for edge applications. We divide the systolic-array of a TPU into different FPGA partitions, where each partition uses different near threshold (NTC) biasing voltages to run its FPGA cores. The biasing voltage for each partition is roughly calculated by the proposed ofﬂine schemes. However, further calibration of biasing voltage is done by the proposed online scheme. Four clustering algorithms based on the slack value of different design paths study the partitioning of FPGA. To overcome the timing failure caused by NTC, the higher slack paths are placed in lower voltage partitions and lower slack paths are placed in higher voltage partitions. The proposed architecture is simulated in Artix-7 FPGA using the V ivado design suite and Python tool. The simulation results substantiate the implementation of voltage scaled TPU in FPGAs and also justiﬁes its power efﬁciency. Index Terms—FPGA partition, Low Power, TPU, Voltage Scaling I. I NTRODUCTION The conﬁgurable logic block (CLB) and switch matrix of FPGAs are power hungry, which makes FPGAs energy inefﬁcient compared to ASICs. Recently many researchers [1], [2] have reported CPU-FPGA based hybrid data center architectures which provides hardware acceleration facility for DNNs. Despite power inefﬁciency, FPGA becomes popular in the Cloud-Scale acceleration architecture due to its specialized hardware and the economic beneﬁts of homogeneity. There- fore, reducing power in FPGA for DNN applications becomes a very relevant topic of research. Article [3] has studied the timing failure vs biasing voltage of DNN implementation in FPGA. They have underscaled biasing voltage V ccint of the entire FPGA to increase the power efﬁciency of Convolution Neural Network (CNN) accelerator by a factor of 3. A single V ccint for the entire FPGA might not be the most power efﬁcient solution. Partitioning an FPGA according to the slacks and feeding different biasing voltages for different partitions can cause further reduction of power for CNN implementations. In [4], the authors have implemented a systolic array using near threshold (NTC) biasing voltage in ASIC, which can predict the timing failure of multiplier- accumulators (MACs) placed inside the systolic array of TPU. The prediction of timing failure is based on Razor ﬂipﬂop [5]. Higher ﬂuctuation of input bits increases the possibility of timing failure in NTC condition. In [4], once the timing failure of a MAC is predicted by its internal Razor ﬂipﬂop, the biasing voltage of the MAC is boosted up. Targeting FPGA based DNN applications [1], our work investigates voltage scaling techniques of TPU in the FPGA platform. Different V ccint for each of the MACs in a systolic array will be an absurd implementation for FPGA, therefore this work partitions FPGA ﬂoor according to the slack value of the path of MACs. Each partition consists a group of paths within the MAC having similar slacks. Each partition is connected with different V ccint . The proposed methodology abstracts the synthesis timing report from the V ivado tool. In a synthesized design, the V ivado IDE timing engine estimates the net delays of paths based on connectivity and fanout. The clustering algorithms create some clusters or groups based on path delays. The clusters with higher delays causing lower slack are placed in FPGA partitions with higher V ccint and the clusters with lower delays causing higher slack are placed in FPGA partitions with lower V ccint . Here the V ccint provides power to a FPGA core. The tuning of V ccint with slack is done by unique offline - online strategy. The circuit level challenges on the implementation of voltage scaling in FPGA platform are beyond the present scope of our article. However, the feasibility of implementing the necessary hardware for voltage scaling support is evident considering the successful implementations in other ASIC technologies. As is unavailable in current FPGAs we have simulated the design for the validation of the claim. The contribution of the paper is as follows: • This paper proposes a new CAD ﬂow to create voltage arXiv:2102.06888v1 [cs.AR] 13 Feb 2021