A GPU Accelerated Storage System Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, Matei Ripeanu Electrical and Computer Engineering Department The University of British Columbia Vancouver, Canada {abdullah, samera, sathish, matei}@ece.ubc.ca ABSTRACT Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. In this context, we focus on data storage: We explore the feasibility of harnessing the GPUs’ computational power to improve the performance, reliability, or security of distributed storage systems. In this context we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications’ performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications. Further, this work sheds light on the use of heterogeneous multicore processors for enhancing low-level system primitives, and introduces techniques to efficiently leverage the processing power of GPUs. Categories and Subject Descriptors D.4.3 [Operating Systems]: File Systems Management - Distributed file systems. D.4.8 [Operating Systems]: Performance - Measurements, Modeling and Prediction. I.3.1 [Computer Graphics]: Hardware Architecture – Graphics processors, Parallel Processing. General Terms Performance, Design, Experimentation. Keywords Storage system design, massively-parallel processors, graphics processing units (GPUs), content addressable storage. 1. INTRODUCTION The development of massively multicore processors has led to a rapid increase in the amount of computational power available on a single die. There are two potential approaches to make the best use of this computational capacity and greater hardware support for concurrency: one is to design applications that are inherently parallel, while the other is to enhance the functionality of applications via ancillary tasks that improve an application’s behavior along dimensions such as reliability and security. While it is possible to increase the degree of parallelism of existing applications, a significant investment is needed to refactor them. Moreover, not all applications offer sufficient scope for parallelism. Therefore, we look into the second approach and investigate techniques to enhance applications along non-functional dimensions. Specifically, we start from the observation that a number of techniques that enhance the reliability, scalability and/or performance of distributed storage systems (e.g., erasure coding, content addressability [1, 2], online data similarity detection [3], integrity checks, digital signatures) generate computational overheads that often hinder their use on today’s commodity hardware. We consider the use of Graphics Processing Units (GPUs) to accelerate these tasks, in effect using a heterogeneous massively multicore system that integrates different execution models (MIMD and SIMD) and memory management techniques (hardware vs. application-managed caches) as our experimental platform. We have previously demonstrated that GPUs can accelerate the computation of hash-based primitives [4]. This paper investigates the system-level challenges and quantifies the benefits of integrating GPU-offloading in a complete storage system. To this end, we have prototyped (Section 3) a distributed storage system which integrates our HashGPU library with the MosaStore content-addressable storage system. Most of the integration challenges are addressed by CrystalGPU a newly developed generic runtime layer that optimizes task execution on the GPU. Our experimental evaluation (Section 4) demonstrates that the proposed architecture enables significant performance improvements compared to a traditional architecture that does not offload compute-intensive primitives. The contribution of this work is fourfold:  First, we demonstrate the viability of employing massively multicore processors, GPUs in particular, to support storage system services. To this end, we evaluate, in the context of a content-addressable distributed storage system, the throughput gains enabled by offloading hash- based primitives to GPUs. We provide a set of data-points that inform the storage system designers’ decision whether exploiting massively multi-core processors to accelerate the storage system operations is a viable approach for particular workloads and deployment environment characteristics.