Middleware in Modern High Performance Computing System Architectures ⋆ Christian Engelmann, Hong Ong, and Stephen L. Scott Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6164, USA {engelmannc,hongong,scottsl}@ornl.gov http://www.fastos.org/molar Abstract. A recent trend in modern high performance computing (HPC) system architectures employs “lean” compute nodes running a lightweight operating system (OS). Certain parts of the OS as well as other system software services are moved to service nodes in order to increase perfor- mance and scalability. This paper examines the impact of this HPC sys- tem architecture trend on HPC “middleware” software solutions, which traditionally equip HPC systems with advanced features, such as par- allel and distributed programming models, appropriate system resource management mechanisms, remote application steering and user interac- tion techniques. Since the approach of keeping the compute node software stack small and simple is orthogonal to the middleware concept of adding missing OS features between OS and application, the role and architec- ture of middleware in modern HPC systems needs to be revisited. The result is a paradigm shift in HPC middleware design, where single mid- dleware services are moved to service nodes, while runtime environments (RTEs) continue to reside on compute nodes. Key words: High Performance Computing, Middleware, Lean Compute Node, Lightweight Operating System 1 Introduction The notion of “middleware” in networked computing systems stems from certain deficiencies of traditional networked operating systems (OSs), such as Unix and its derivatives, e.g., Linux, to seamlessly collaborate and cooperate. The concept of concurrent networked computing and its two variants, parallel and distributed computing, is based on the idea of using multiple networked computing systems collectively to achieve a common goal. While traditional OSs contain networking features, they lack in parallel and distributed programming models, appropri- ate system resource management mechanisms, remote application steering and ⋆ This research is sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. The work was performed at the Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC under Contract No. De-AC05- 00OR22725.