IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 4, APRIL 2001 477 Compact and Efficient Code Generation Through Program Restructuring on Limited Memory Embedded DSPs Siddharth Rele, Vipin Jain, Santosh Pande, and J. Ramanujam Abstract—Many embedded systems such as digital cameras, digital radios, high-resolution printers, cellular phones, etc., in- volve a heavy use of signal processing and are thus based on digital signal processors (DSPs). DSPs such as the TMS320C2x and the DSP5600x have irregular data paths that typically result due to application specific needs (such as chaining multiply–accumulate operations, etc.). Efficient code generation for such embedded DSP processors is a challenging problem. The stringent requirements such as tight memory constraints and fast response time result in the need for a compact and efficient code. In this paper, we address the problem of generating a compact and efficient code for embedded DSP processors. Most of the DSP instruction set architectures (ISAs) feature intrainstruction parallelism (IIP), enabling individual operations to be executed in parallel by generating a complex instruction. A reduction in generated code size and improved performance can be achieved by exploiting this parallelism present in such ISAs. In this paper, we present a code restructuring technique to fully exploit this parallelism through maximal utilization of the complex instructions present in the instruction set. We formulate this as a maximal benefit code restructuring problem, which is to derive the arrangement of statements to maximally exploit IIP without violating data dependencies. This problem is equivalent to the precedence constrained Hamiltonian path problem for directed acyclic graphs and the traveling salesman problem in general, both of which are NP-hard. In this paper, we present an optimal algorithm to solve the problem. We have implemented this optimal algorithm in a compiler targeted to generate code for the TMS320C25 DSP. We tested our framework on a number of benchmarks and found that the performance of the generated code (measured in dynamic instruction cycle counts) improves by as much as 9.9% with an average of 4%. The average code-size reduction over code compiled without exploiting parallelism is 2.9%. We also studied the effect of loop unrolling on the available IIP. An on-chip Manuscript received November 1, 1999; revised June 13, 2000 and December 13, 2000. The work of S. Pande was supported in part by DARPA under Con- tract ARMY DABT63-97-C-0029 and the National Science Foundation under Grant CCR-0073512. The work of J. Ramanujam was supported in part by a National Science Foundation Young Investigator Award CCR-9457768 and by the National Science Foundation under Grant CCR-0073800. This paper was recommended by Associate Editor R. Camposano. S. Rele is with the Compiler Research Laboratory, Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221 USA (e-mail: srele@ececs.uc.edu). V. Jain was with the Compiler Research Laboratory, Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221 USA. He is now with the the Server Technology Division, Oracle Corporation, San Jose, CA 95101 USA. S. Pande was with the University of Cincinnati, Cincinnati, OH 45221 USA. He is now with the College of Computing, Georgia Institute of Technology, Atlanta, GA 30318 USA (e-mail: santosh@cc.gatech.edu). J. Ramanujam is with the Department of Elecrical and Computer Engi- neering, Louisiana State University, Baton Rouge, LA 70803 USA (e-mail: jxr@ee.lsu.edu). Publisher Item Identifier S 0278-0070(01)01939-X. instruction cache can be effectively utilized by unrolling loops such that generated code fully occupies the memory. The benefit is reduction in dynamic instruction count due to the higher number of complex instructions generated. We found that by unrolling loop by four to five times to fit available on-chip instruction cache, the dynamic instruction counts reduce by as much as 9.9%. Index Terms—Code compaction, complex instructing ISAs, DSPs. I. INTRODUCTION E MBEDDED processors are widely used in a variety of applications such as cellular phones, pagers, printers, copiers, digital cameras, automobiles, flight navigation sys- tems, etc. Unlike general purpose processors, embedded processors are designed and optimized for specific (classes of) applications [13]. Embedded systems are constrained by limited on-chip program memory [2], real time performance requirements [18], [27], and low power consumption demands. The evaluation criteria for embedded processors are different from those of general purpose processors. The following cri- teria are typically used while comparing embedded processors [34]. 1) Performance: The cost-performance ratio of embedded systems is measured in MIPS/dollar. It is one of the most impor- tant criteria for judging embedded processors due to the desired real time constraints and low costs of the system in which the processor is embedded. 2) Code Size Versus Density: The code size targeted toward complex instruction set computing (CISC) architectures can be smaller than the one that is targeted toward reduced instruction set computing (RISC) architectures due to the presence of com- plex instructions. However, in the absence of aggressive com- piler optimizations, the code density of CISCs is poor compared to RISCs. Thus compilers have to play a big role in improving the code density by performing machine-dependent optimiza- tions specifically designed for reducing the size of the generated code through code restructuring transformations. Traditionally, embedded processors and systems are pro- grammed using assembly language in order to meet the hard performance constraints and limited program memory. How- ever, programming large complex applications in assembly language is tedious, error-prone, and time-consuming; in addition, such programs are difficult to maintain. High-level languages like C and C are replacing assembly language in embedded programming. Programming in high-level language 0278–0070/01$10.00 © 2001 IEEE