Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit J. Nieplocha, R.J. Harrison, M.K. Kumar, B. Palmer, V. Tipparaju, H. Trease Pacific Northwest National Laboratory Introduction Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in the modern computers this characteristic might have a negative impact on performance and scalability. Various techniques, such as code restructuring to increase data reuse and introducing blocking in data accesses, can address the problem and yield performance competitive with message passing [Singh], however at the cost of compromising the ease of use feature. Distributed memory models such as message passing or one- sided communication offer performance and scalability but they compromise the ease-of-use. In this context, the message-passing model is sometimes referred to as “assembly programming for the scientific computing”. The Global Arrays toolkit [GA1, GA2] attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed explicitly by the programmer. This management is achieved by explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared- memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be explicitly specified and hence managed. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolution. The Global Arrays Approach Virtually all the scalable architectures possess non-uniform memory access characteristics that reflect their multi-level memory hierarchies. These hierarchies typically comprise processor registers, multiple levels of cache, local memory, and remote memory. In future systems, both the number of levels and the cost (in processor cycles) of accessing deeper levels can be expected to increase. It is important for programming models to address memory hierarchy since it is critical to the efficient execution of scalable applications. The two dominant programming models for MIMD concurrent computing are message passing and shared memory. A message-passing operation not only transfers data but also synchronizes sender and receiver. Asynchronous (nonblocking) send/receive operations can be used to diffuse the synchronization point, but cooperation between sender and receiver is still required. The synchronization effect is beneficial in certain classes of algorithms such as parallel linear algebra where data transfer usually indicates completion of some computational phase; in these algorithms, the synchronizing messages can often carry both the results and a required dependency. For other algorithms, this synchronization can be unnecessary and undesirable, and a source of performance degradation and programming complexity. Despite programming difficulties, the message-passing paradigm’s memory model maps well to the distributed-memory architectures used in scalable MPP systems. Because the programmer must explicitly control data distribution and is required to address data-locality issues, message- passing applications tend to execute efficiently on such systems. However, on systems with multiple levels of remote memory, for example networks of SMP workstations or computational grids, the message-passing model’s classification of main memory as local or remote can be inadequate. A hybrid model that extends MPI with OpenMP attempts to address this problem is very hard to use and often offers little advantage over the MPI only approach. In the shared-memory programming model, data is located either in “private” memory (accessible only by a specific process) or in “global” memory (accessible to all processes). In shared-memory systems, global memory