A Run-Time Modulo Scheduling by using a Binary
Translation Mechanism
Ricardo Ferreira,
Waldir Denver
Departamento Informatica
UFV
Vicosa, Brazil
ricardo@ufv.br
Monica Pereira
Departamento de Informatica
e Matematica Aplicada
UFRN
Natal/RN, Brazil
monicapereira@dimap.ufrn.br
Jorge Quadros,
Luigi Carro
Instituto de Informatica
UFRGS
Porto Alegre, Brazil
carro@inf.ufrgs.br
Stephan Wong
Computer Engineering Lab.
TU Delft
Delft, Netherlands
J.S.S.M.Wong@tudelft.nl
Abstract—It is well known that innermost loop optimizations
have a big effect on the total execution time. Although CGRAs is
widely used for this type of optimizations, their usage at run-time
has been limited due to the overheads introduced by application
analysis, code transformation, and reconfiguration. These steps
are normally performed during compile time. In this work, we
present the first dynamic translation technique for the modulo
scheduling approach that can convert binary code on-the-fly to
run on a CGRA. The proposed mechanism ensures software
compatibility as it supports different source ISAs. As proof of
concept of scaling, a change in the memory bandwidth has been
evaluated (from one memory access per cycle to two memory
accesses per cycle). Moreover, a comparison to the state-of-the-art
static compiler-based approaches for inner loop accelerators has
been done by using CGRA and VLIW as target architectures.
Additionally, to measure area and performance, the proposed
CGRA was prototyped on a FPGA. The area comparisons show
that crossbar CGRA (with 16 processing elements) is 1.9x larger
than the VLIW 4-issue and 1.3x smaller than a VLIW 8-issue
softcore processor, respectively. In addition, it reaches an overall
speedup factor of 2.17x and 2.0x in comparison to the 4 and
8-issue, respectively. Our results also demonstrate that the run-
time algorithm can reach a near-optimal ILP rate, better than
an off-line compiler approach for an n-issue VLIW processor.
I. I NTRODUCTION
The ever-increasing complexity of embedded system ap-
plications and the demand for combining many functionalities
in a single system have increased the need for systems able
to efficiently execute applications with heterogeneous behav-
ior [1]. In order to efficiently execute these applications, it
is necessary to find solutions able to identify (at run-time) the
particular behavior of each application and use this information
as a mechanism to improve performance. In this paper, we
focus on run-time techniques and reconfigurable architectures
to support inner loop processing. Moreover, the proposed run-
time approach is based on binary translation mechanisms, and
it could be extended to handle other application behaviors.
Nowadays, there is a large amount of streaming data
mostly produced by sensors, telecommunication, and multi-
media applications. These applications are implemented in
general by using intensive loops. In addition, systems with
different processing capabilities, ranging from embedded to
exascale computing, require efficiency in terms of performance
and power (Gops/W). Coarse-Grained Reconfigurable Archi-
tectures (CGRAs) have shown that they can provide both
power efficiency and hardware acceleration [2].
In past years, many solutions emerged in an attempt to
increase the loop performance by using Modulo Scheduling
and CGRAs [2], [3], [4], [5], [6], [7], [8], [9], [10]. CGRAs
are especially suitable for this, since they have a lower config-
uration overhead than fine-grained ones, such as FPGAs [11].
In spite of that, all solutions found in literature require special
compilers or modifications in the application, which, in turn,
precludes code reuse and software compatibility.
Recent works proposed the use of binary translation as
a solution to reduce the intrinsic performance overhead of
CGRA [12], [13]. Binary translation converts code compiled
to a source ISA to run in a different ISA, in order to
ensure software compatibility between different versions, or
to allow application execution in different ISAs without the
need for code recompilation. Additionally, run-time binary
translation does not require compiler modifications, and may
take advantage of optimizations that are not possible at compile
time. Along with the possibility of optimizing the execution,
run-time mechanisms are becoming essential due to the dy-
namic behavior of many applications, such as data-dependent
computation, whose behavior may vary based on the inputs.
To fulfill the requirements of code reuse and software
compatibility, we propose to apply binary translation (BT)
onto the modulo scheduling (MS) approaches. To the best
of our knowledge, no previous work has been carried out in
order to define a BT run-time modulo scheduling algorithm for
CGRAs. Moreover, a huge compile time reduction should be
achieved, since this is the major challenge faced in previous
modulo scheduling algorithms [2], [3], [4], [5], [6], [7], [8].
Recently, a low-complexity MS algorithm suitable for just-in-
time (JIT) compilation was proposed in [9]. A CGRA with
a crossbar network is used in [9] to reduce the complexity
instead of mesh topologies [14], [2], [15], [4]. Nevertheless,
the MS-JIT assumes that the starting point to perform the
MS is a loop dataflow graph (DFG), and therefore it requires
special JIT compilers or modifications in the application (like
pragmas) to detect the loop and generate the DFG graph. In this
work, we propose the first algorithm to detect, generate, and
schedule the loop from the binary code. Another advantage
of the proposed mechanism is its capability to benefit from
the scaling process. For instance, if the memory bandwidth is
improved, the binary translator could use this information to
978-1-4799-3770-7/14/$31.00 ©2014 IEEE 75
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)