Published at the 2nd Greater Chicago Area System Research Workshop, 2013 D. Zhao, et al FusionFS: a distributed file system for large scale data-intensive computing Dongfang Zhao * , Chen Shou * , Zhao Zhang Iman Sadooghi * , Xiaobing Zhou * , Tonglin Li * , Ioan Raicu *‡ * Department of Computer Science, Illinois Institute of Technology Department of Computer Science, University of Chicago Mathematics and Computer Science Division, Argonne National Laboratory I. I NTRODUCTION Today’s science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. Exascale computing, i.e. 10 18 FLOPS, is predicted to emerge by 2019 with current trends. Millions of nodes and billions of threads of execution, producing similarly large concurrent data accesses, are expected with the exascale. Current state-of-the-art yet decades long storage architecture of high-performance computing (HPC) systems would unlikely provide the support for the expected level of concurrent data access. The main critique comes from the topological alloca- tion of compute and storage resources that are interconnected as two cliques. Even though the network between compute and storage has high bandwidth and is sufficient for compute intensive petascale applications, it would not be adequate for data-intensive petascale computing or the emerging exascale computing (regardless if it is compute or data intensive). We introduce FusionFS, a distributed filesystem particularly crafted for extreme scale HPC systems. FusionFS leverages FUSE [1] to work in user space and provides a POSIX interface, so that neither the OS kernel nor applications need any changes. Non-Volatile Memory(NVM) has proven to offer large gains for high-performance I/O-intensive appli- cations [3], and FusionFS complies with Gordon [4] archi- tecture by taking local NVM as local storage coexisting with processors. FusionFS has a completely distributed metadata management based on an implementation of distributed hash table (i.e. ZHT [7]) to achieve a scalable metadata throughput. FusionFS also delivers a scalable high I/O throughput based on maximizing the data locality in typical read/write data access patterns. II. DESIGN AND I MPLEMENTATION Figure 1 illustrates the allocation of different node types in a typical supercomputer setup, i.e. IBM BlueGene/P. The traditional parallel filesystem (e.g. GPFS) is mounted on the storage nodes. The fact that compute nodes need to access the remotely connected storage nodes was not an issue for compute-intensive applications. However this architecture would seriously jeopardize large scale data-intensive appli- cations. Burst Buffer [8] alleviates the issue in the sense of elevating data from storage nodes to I/O nodes as a persistent cache. This architecture clearly has at least two advantages: (1) the network latency is improved by reducing the hops from 2 to 1, conceptually; (2) the data concurrency is increased from O(100) to O(1K). Nevertheless, Burst Buffer is still a “remote” storage from the perspective of compute nodes. We propose that each compute node should actively participate into both the computation and the data I/O, which is illustrated as the green layer. This would fully exploit the high speed bandwidth (e.g. 3D-torus) between compute nodes, and make data locality explicit for computation. Fig. 1. Storage spectrum of IBM BlueGene/P FusionFS is implemented with C/C++ and Shell scripts, excluding two third-party libraries: the Google Protocol Buffers [2] and UDT [5]. The software stack of FusionFS is shown in Figure 2. Three services (metadata, data transfer, and provenance) are on top of the stack, that are supported by FusionFS Core and FusionFS Utilities interacting with the kernel FUSE module. Fig. 2. FusionFS software stack Page 1 of 2