Blue Gene/L programming and operating environment J. E. Moreira G. Alma´ si C. Archer R. Bellofatto P. Bergner J. R. Brunheroto M. Brutman J. G. Castan˜ os P. G. Crumley M. Gupta T. Inglett D. Lieber D. Limpert P. McCarthy M. Megerian M. Mendell M. Mundy D. Reed R. K. Sahoo A. Sanomiya R. Shok B. Smith G. G. Stewart With up to 65,536 compute nodes and a peak performance of more than 360 teraflops, the Blue Genet/L (BG/L) supercomputer represents a new level of massively parallel systems. The system software stack for BG/L creates a programming and operating environment that harnesses the raw power of this architecture with great effectiveness. The design and implementation of this environment followed three major principles: simplicity, performance, and familiarity. By specializing the services provided by each component of the system architecture, we were able to keep each one simple and leverage the BG/L hardware features to deliver high performance to applications. We also implemented standard programming interfaces and programming languages that greatly simplified the job of porting applications to BG/L. The effectiveness of our approach has been demonstrated by the operational success of several prototype and production machines, which have already been scaled to 16,384 nodes. Introduction The Blue Gene * /L supercomputer has been developed by IBM in collaboration with Lawrence Livermore National Laboratory (LLNL). Several installations of Blue Gene/L (BG/L) are currently planned, both in the United States and abroad. The flagship system, with 65,536 compute nodes and more than 360 teraflops of peak computing power, will be located at LLNL in Livermore, California. With its extreme scalability and raw performance, BG/L represents a new level of massively parallel systems. The system software team faced a major challenge in designing and implementing a programming and operating environment that could harness the power of this new architecture. In designing the BG/L system software, we followed three major principles: simplicity, performance, and familiarity. Because we targeted BG/L primarily for scientific computations, we kept the system software simple for ease of development and to enable high reliability. For example, we impose a simplifying requirement that the machine be used only in a strictly space-sharing mode—only one (parallel) job can run at a time on a BG/L partition. Furthermore, we support only one thread of execution per processor. Another major simplification is the preclusion of demand paging support in the virtual memory system, thus limiting the virtual memory available per node to the physical memory size. These simplifications lead directly to performance benefits that allowed us to take advantage of hardware features and deliver a high-performance system with no sacrifice in stability and security. For example, strict space-sharing enables us to use efficient, user-mode communication without protection problems. It also ensures that there is always a dedicated processor behind each application-level thread, which leads to more deterministic execution and higher scalability. The simplified virtual memory system ensures that there are no page faults or translation lookaside buffer misses during program execution on the compute nodes, leading to higher and more deterministic performance. Finally, we developed a programming environment based on familiar languages and libraries without penalizing simplicity or performance. This adherence to language and Message Passing Interface (MPI) library standards has enabled the porting of several large scientific applications to BG/L with relatively modest efforts. Critical to our system software strategy (and to the whole project strategy) is the concept of specialization ÓCopyright 2005 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. IBM J. RES. & DEV. VOL. 49 NO. 2/3 MARCH/MAY 2005 J. E. MOREIRA ET AL. 367 0018-8646/05/$5.00 ª 2005 IBM