A Run-Time Modulo Scheduling by using a Binary Translation Mechanism Ricardo Ferreira, Waldir Denver Departamento Informatica UFV Vicosa, Brazil ricardo@ufv.br Monica Pereira Departamento de Informatica e Matematica Aplicada UFRN Natal/RN, Brazil monicapereira@dimap.ufrn.br Jorge Quadros, Luigi Carro Instituto de Informatica UFRGS Porto Alegre, Brazil carro@inf.ufrgs.br Stephan Wong Computer Engineering Lab. TU Delft Delft, Netherlands J.S.S.M.Wong@tudelft.nl Abstract—It is well known that innermost loop optimizations have a big effect on the total execution time. Although CGRAs is widely used for this type of optimizations, their usage at run-time has been limited due to the overheads introduced by application analysis, code transformation, and reconfiguration. These steps are normally performed during compile time. In this work, we present the first dynamic translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated (from one memory access per cycle to two memory accesses per cycle). Moreover, a comparison to the state-of-the-art static compiler-based approaches for inner loop accelerators has been done by using CGRA and VLIW as target architectures. Additionally, to measure area and performance, the proposed CGRA was prototyped on a FPGA. The area comparisons show that crossbar CGRA (with 16 processing elements) is 1.9x larger than the VLIW 4-issue and 1.3x smaller than a VLIW 8-issue softcore processor, respectively. In addition, it reaches an overall speedup factor of 2.17x and 2.0x in comparison to the 4 and 8-issue, respectively. Our results also demonstrate that the run- time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for an n-issue VLIW processor. I. I NTRODUCTION The ever-increasing complexity of embedded system ap- plications and the demand for combining many functionalities in a single system have increased the need for systems able to efficiently execute applications with heterogeneous behav- ior [1]. In order to efficiently execute these applications, it is necessary to find solutions able to identify (at run-time) the particular behavior of each application and use this information as a mechanism to improve performance. In this paper, we focus on run-time techniques and reconfigurable architectures to support inner loop processing. Moreover, the proposed run- time approach is based on binary translation mechanisms, and it could be extended to handle other application behaviors. Nowadays, there is a large amount of streaming data mostly produced by sensors, telecommunication, and multi- media applications. These applications are implemented in general by using intensive loops. In addition, systems with different processing capabilities, ranging from embedded to exascale computing, require efficiency in terms of performance and power (Gops/W). Coarse-Grained Reconfigurable Archi- tectures (CGRAs) have shown that they can provide both power efficiency and hardware acceleration [2]. In past years, many solutions emerged in an attempt to increase the loop performance by using Modulo Scheduling and CGRAs [2], [3], [4], [5], [6], [7], [8], [9], [10]. CGRAs are especially suitable for this, since they have a lower config- uration overhead than fine-grained ones, such as FPGAs [11]. In spite of that, all solutions found in literature require special compilers or modifications in the application, which, in turn, precludes code reuse and software compatibility. Recent works proposed the use of binary translation as a solution to reduce the intrinsic performance overhead of CGRA [12], [13]. Binary translation converts code compiled to a source ISA to run in a different ISA, in order to ensure software compatibility between different versions, or to allow application execution in different ISAs without the need for code recompilation. Additionally, run-time binary translation does not require compiler modifications, and may take advantage of optimizations that are not possible at compile time. Along with the possibility of optimizing the execution, run-time mechanisms are becoming essential due to the dy- namic behavior of many applications, such as data-dependent computation, whose behavior may vary based on the inputs. To fulfill the requirements of code reuse and software compatibility, we propose to apply binary translation (BT) onto the modulo scheduling (MS) approaches. To the best of our knowledge, no previous work has been carried out in order to define a BT run-time modulo scheduling algorithm for CGRAs. Moreover, a huge compile time reduction should be achieved, since this is the major challenge faced in previous modulo scheduling algorithms [2], [3], [4], [5], [6], [7], [8]. Recently, a low-complexity MS algorithm suitable for just-in- time (JIT) compilation was proposed in [9]. A CGRA with a crossbar network is used in [9] to reduce the complexity instead of mesh topologies [14], [2], [15], [4]. Nevertheless, the MS-JIT assumes that the starting point to perform the MS is a loop dataflow graph (DFG), and therefore it requires special JIT compilers or modifications in the application (like pragmas) to detect the loop and generate the DFG graph. In this work, we propose the first algorithm to detect, generate, and schedule the loop from the binary code. Another advantage of the proposed mechanism is its capability to benefit from the scaling process. For instance, if the memory bandwidth is improved, the binary translator could use this information to 978-1-4799-3770-7/14/$31.00 ©2014 IEEE 75 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)