Profiling Tools for Hardware/Software Partitioning of Embedded Applications Dinesh C. Suresh University of California, Riverside dinesh@cs.ucr.edu Walid A. Najjar University of California, Riverside najjar@cs.ucr.edu Frank Vahid University of California, Riverside vahid@cs.ucr.edu Jason R. Villarreal University of California, Riverside villarre@cs.ucr.edu Greg Stitt University of California, Riverside gstitt@cs.ucr.edu ABSTRACT Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instruction set simulator and can therefore be augmented with architecture and micro- architecture features simulation while the other is based on compile-time instrumentation of gcc and therefore has very little slow down compared to the original program We use the results of the profiling to identify the compute core in each benchmark and study the effect of compile-time optimization on the distribution of cores in a program. We also study the potential speedup that can be achieved using a configurable system on a chip, consisting of a CPU embedded on an FPGA, as an example application of these tools in hardware/software partitioning. Categories and Subject Descriptors C.3 [Performance of Systems]: Measurement techniques, Design studies – profiling techniques, hardware/software partitioning. General Terms Measurement, Performance, Design, Experimentation. Keywords Hardware/Software partitioning, loop analysis, compiler optimization. 1. INTRODUCTION Embedded software is the key contributor to embedded system performance and power consumption. Program execution tends to spend most of the time in a small fraction of code, a feature known as the “90-10 rule” – 90% of the execution time comes from 10% of the code. By their very nature, embedded applications tend to follow the 90-10 rule even more so than desktop type of applications. Tools seeking to optimize the performance and/or energy consumption embedded software therefore should focus first on finding that critical code. Possible optimizations include aggressive recompilation, customized instruction synthesis, customized memory hierarchy synthesis, and hardware/software partitioning [10,2] all focusing on the critical code regions. Of those critical code regions, about 85% of those regions are inner loops, while the remaining 15% are functions. A partitioning tool should focus first on finding the most critical software loops and understanding the execution statistics of those loops, after which the tool should try partitioning alternatives coupled with loop transformations in the hardware (such as loop unrolling). Our particular interest is in the hardware/software partitioning of programs, but our methods can be applied to the other optimization approaches too. Many profiling tools have been developed. Some tools, like gprof, only provide function-level profiling and do not provide sufficient information at a more detailed level, such as loop information, necessary for partitioning. However, tools that profile at a more detailed level tend to focus on statements or blocks – a user interested in loops must implement additional functionality on top of those profilers. Furthermore, many profiling tools, like ATOM [12] or Spix [11], are specific to a particular microprocessor family. Instruction-level profiling tools can be tuned to provide useful information regarding the percentage of time spent in different loops of a program. Instruction profiling tools can be broadly classified into two categories – compilation based instruction profilers and simulation based instruction profilers. A compilation based profiler instruments the program by adding counters to various basic blocks of the program. During execution the counter values are written to a separate file. Simulation based instruction profiler uses an instruction set simulators they can be further classified into static or dynamic profilers. Simulation based dynamic instruction profilers obtain the instruction profile during the execution of the code on the simulator while in static profiling the execution is written to a trace and the trace is processed to get instruction counts. For very large applications, the trace generated by static profiling can grow to unmanageable proportions. Even though a dynamic profiling method is slow compared to the compiler-based instrumentation, a variety of architectural parameters can be tuned and studied while the program gets profiled on a full system simulator. We have developed a profiling tool that focuses on collecting loop level information for a very large variety of microprocessor platforms. Our profiling tool supports both the instrumentation and the simulation paradigms. We achieved this goal by building on top of two very popular tools – gcc for instrumentation, and Simics [9] for simulation – while keeping the output identical for the two Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LCTES ’03, June 11-13, 2003, San Diego, California, USA. Copyright 2003 ACM 1-58113-647-1/03/0006…$5.00. Proc. of the 2003 ACM SIGPLAN Conf. on Languages, Compilers and Tools for Embedded Systems, San Diego, CA June 2003