Software De-Pipelining Technique Bogong Su 1 Jian Wang 2 Erh-Wen Hu 1 Joseph Manzano 1 sub@wpunj.edu jiwang@nortelnetworks.com hue@wpunj.edu Josbry21@cs.com 1 Dept. of Computer Science, The William Paterson University of New Jersey, USA 2 Wireless Speech and Data Processing, Nortel Networks, Montreal, Canada Abstract Software pipelining is a loop optimization technique used to speed up loop execution. It is widely implemented in optimizing compilers for VLIW and superscalar processors that supports instruction level parallelism. Software de-pipelining is the reverse of software pipelining; it restores the assembly code of a software-pipelined loop back to its semantically equivalent sequential form. Due to the non-sequential nature of the often optimized assembly code, it is very difficult to gain insight into the meaning of the code. Consequently, the task of de-pipelining the code of a software-pipelined loop is very complex and challenging. We present in this paper our de-pipelining algorithm with a formal description, proof, and a set of working examples. Experiments with loops taken from some practical DSP programs are conducted on popular VLIW digital signal processors to verify the algorithm. Some applications of software de-pipelining are discussed. 1. Introduction Because of the practical importance of porting low- level code from one processor to another, decompilation has been studied for many years [1,3,5,9,14,20]. Yet few of these studies have dealt with source machine that supports instruction level parallelism or ILP [2]. De- compiling optimized code is difficult because the de- compiler must de-optimize the low-level code of the source machine [14]. It is even more so when the source machine supports ILP and the source code has been optimized by software pipelining. Software pipelining [6,11,15,21] is a loop optimization technique used to speed up loop execution. It is widely implemented in optimizing compilers for VLIW and superscalar processors [8,13] such as IA-64, Texas Instruments' C6X and StarCore’s SC140 DSP that support instruction level parallelism. Software de-pipelining (de-pipelining hereafter) [16] is the reverse of software pipelining; it restores the assembly code of a software-pipelined loop back to its semantically equivalent sequential form. The motivation for our study of de-pipelining is as follows. First, due to the transformation of the original sequential code, especially when the source machine has large branch delay and/or when it uses sophisticated optimization techniques such as prelude and postlude collapsing [7], the code of a software-pipelined loop is very difficult to comprehend, analyze, and debug. As an example, Figure 1.1 shows the assembly code segment of a software-pipelined loop optimized with both prelude and postlude collapsing for Texas Instruments' C62 (TIC62 hereafter) processor. The “||” symbol in the code segment means that the instruction in the current line is executed in parallel with the instruction in the previous line, and the set of instructions executed in parallel is referred to as an instruction group in this paper. Because TIC62 has long branch delay (6 clock cycles) and its compiler performs prelude and postlude collapsing on the software-pipelined loop in order to reduce code size, the instructions in the code segment have been so transformed that it is very difficult comprehend the meaning of the code and to determine if this code segment is a software-pipelined loop, let alone to identify the body, the prelude and the postlude of the software-pipelined loop. MVK 57, A1 [A1] SUB A1,1,A1 || ZERO A7 || ZERO B7 [A1] SUB A1,1,A1 || [A1] B LOOP || ZERO A6 || ZERO B6 [A1] SUB A1,1,A1 || [A1] B LOOP || ZERO A2 || ZERO B2 [A1] SUB A1,1,A1 || [A1] B LOOP [A1] SUB A1,1,A1 || [A1] B LOOP [A1] SUB A1,1,A1 || [A1] B LOOP LOOP: LDW *A4++,A2 || LDW *B4++,B2 || [A1] SUB A1,1,A1 || [A1] B LOOP || MPY A2,B2,A6 || MPYH A2,B2,B6 || ADD A6,A7,A7 || ADD B6,B7,B7 ADD A7,B7,A4 Figure 1.1 An assembly code segment of TIC62