Static H. Synchronization Beyond VLIW School of Electrical Engineering Purdue University West Lafayette, IN 47907 hankd@ecn.purdue.edu A key advantage of SIMD (Single Instruction stream, Multiple Data stream) architectures is that synchronization is effected staticall;y at compile-time, hence the execution- time cost of synchronization between “processes” is easen- tially zero. VLIW (Very Long Instruction Word) machines are successful in large part because they preserve this pro- perty while providing more flexibility in terms of what kinds of operations can be parallelized. In this paper, we propose a new kind of architecture - the “static barrier MIMD” or SBM -- which can be viewed as a further gen- eralization of the ,parallel execution abilities of static syn- chronization machines. 1. Introduction Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data stream architectures capable of paral- lel execution of l.oops, subprogram calls, and variable- execution-time instructions. However, instead of using bar- riers as a synchronization mechanism, the proposed barrier hardware is used to impose static timing constraints. Since the compiler can know at compile time all instructions which each proceszor could be executing when a particular conceptual synchronization operation is needed, it can resolve most synchronizations by using VLIW-like compile time instruction scheduling - without use of a runtime synchronization mmschanism. The effect is that the proposed barrier mechanism greatly extends the generality of efficient static scheduling without adding a significant hardware cost. Traditional, directed-synchronization, MIMD archi- tectures are more flexible than barrier MIMDs, but the benefits of static scheduling make barrier MIMDs superior for fine to medium grain parallelism. Both the barrier architecture and the supporting compiler technology are discused in this paper. PASM is the PArtitionable Simd/Mimd system designed by H. J. Siegel et. al. to incorporate up to 1024 full-featured processors (SiS81]. .A 16 processing-element PASM prototype [ScN87] has been constructed and is operational within the School of :Electrical Engineering at Purdue University. In Spring 1987, Dietz and Schwederski met to discuss the possibility of upgrading the PASM hardware to support a VLIW mode of execution. Surpris- ingly, it became apparent that although PASM could not easily support VLIW execution, it was already capable of an execution model which is not SIMD, MIMD, nor alternately or in partitions SIMD and MIMD, but rather something between SIMD and MIMD - and more general than VLIW. The PASM prototype processing elements (PEs) are conventional microprocessors whic’h normally fetch and exe- cute their own instructions. The PE local memory is divided into two segments: MIMD and SIMD space. When instructions are fetched from MIMD space, PE5 execute asynchronously and independently. When instruction fetches are made from SIMD space, extra memory wait state5 are inserted until all enabled PEs have accessed SIMD space, at which time the next enqueued SIMD instruction is broadcast to all enabled PEs (the fetched value is the broadcast SIMD instruction). Keywords5 SIMD, VLIW, LSM, SBM, DBM, MIMD, barrier-synchronization, code-scheduling, compiler- optimization. The unexpected new mode is what we call SBM: Static Barrier MIMD. It is accomplished by making a data fetch, rather than an instruction fetch, from SIMD space. Hence, the overhead for invoking this type of barrier on the PASM prototype is a single memory read. The data value is ignored, but the effect is that all enabled processors are barrier synchronized and resume execution at ezactly the same clock cycle. This exact synchronization after encountering a barrier also makes it possible to use VLIW- like static code scheduling techniques to reduce the number of runtime synchronizations. t T. Schwederski is currently at the Institute for Microelec- tronics Stuttgart, Allmandring 30 a, 7000 Stuttgart 80, FRG. Permission to copy without fee all or part of this material is granted provided that the copies are not matde or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. Various simple benchmarks have been run using the PASM prototype in this mode [FiC87] [FiC88], albeit without taking full advantage of VLIW-like code scheduling techniques. Preliminary results have been very promising. Research into the performance of barrier hardware and compiler scheduling technology has also shown great prom- ise, and a few initial results are presented later in this papa. 0 1989 ACM 089791-341-8/89/001 l/O416 $1.50 Section 2 of this paper present5 an overview of a new Dietz, T. Schwederskit, M. O’Keefe, and A. Zaafrani 416