CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 0000; 00:1–27 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe A Convergence of Key-Value Storage Systems from Clouds to Supercomputers Tonglin Li 1 , Xiaobing Zhou 3 , Ke Wang 1 , Dongfang Zhao 1 ,Iman Sadooghi 1 , Zhao Zhang 4 , Ioan Raicu 1,2 1 Computer Science Department, Illinois Institute of Technology, Chicago, IL, USA) 2 MCS Division, Argonne National Laboratory, Lemont, IL, USA 3 Hortonworks, Palo Alto, CA, USA 4 AMP Lab, University of California, Berkeley, CA, USA SUMMARY This paper presents a convergence of distributed Key-Value storage systems in clouds and supercomputers. It speciﬁcally presents ZHT, a zero-hop distributed key-value store system, which has been tuned for the requirements of high-end computing systems. ZHT aims to be a building block for future distributed systems, such as parallel and distributed ﬁle systems, distributed job management systems, and parallel programming systems. ZHT has some important properties, such as being light-weight, dynamically allowing nodes join and leave, fault tolerant through replication, persistent, scalable, and supporting unconventional operations such as append, compare and swap, callback in addition to the traditional insert/lookup/remove. We have evaluated ZHT’s performance under a variety of systems, ranging from a Linux cluster with 64-nodes, an Amazon EC2 virtual cluster up to 96-nodes, to an IBM Blue Gene/P supercomputer with 8K-nodes. We compared ZHT against other key/value stores and found it oﬀers superior performance for the features and portability it supports. This paper also presents several real systems that have adopted ZHT, namely FusionFS (a distributed ﬁle system), IStore (a storage system with erasure coding), MATRIX (distributed scheduling), Slurm++ (distributed HPC job launch), Fabriq (distributed message queue management); all of these real systems have been simpliﬁed due to Key-Value storage systems, and have been shown to outperform other leading systems by orders of magnitude in some cases. It’s important to highlight that some of these systems are rooted in HPC systems from supercomputers, while others are rooted in clouds and ad-hoc distributed systems; through our work, we have shown how versatile Key-Value storage systems can be in such a variety of environments. Copyright c  0000 John Wiley & Sons, Ltd. Received . . . KEY WORDS: NoSQL Database, Distributed Key-value store, supercomputer, cloud computing 1. INTRODUCTION Today’s science is generating datasets that are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century [1]. As supercomputers gain more parallelism at exponential rates, the storage infrastructure performance is increasing at a signiﬁcantly lower rate. This implies that the data management and data ﬂow between the storage and compute resources is becoming the new bottleneck for large-scale applications. The support for data intensive computing is critical to advancing modern science as storage systems have experienced a gap between capacity and bandwidth that increased more than 10-fold over the last decade. There is an emerging need for advanced techniques to manipulate, visualize and interpret large datasets. Many domains (e.g. Copyright c  0000 John Wiley & Sons, Ltd. Prepared using cpeauth.cls [Version: 2010/05/13 v3.00]