Efficient In-memory Data Management: An Analysis Hao Zhang , Bogdan Marius Tudor , Gang Chen # , Beng Chin Ooi National University of Singapore, # Zhejiang University {zhangh,bogdan,ooibc}@comp.nus.edu.sg, # cg@cs.zju.edu.cn ABSTRACT This paper analyzes the performance of three systems for in-memory data management: Memcached, Redis and the Resilient Distributed Datasets (RDD) implemented by Spark. By performing a thorough performance analysis of both analytics operations and fine-grained object operations such as set/get, we show that neither system han- dles efficiently both types of workloads. For Memcached and Redis the CPU and I/O performance of the TCP stack are the bottlenecks – even when serving in-memory objects within a single server node. RDD does not support efficient get operation for random objects, due to a large startup cost of the get job. Our analysis reveals a set of features that a system must support in order to achieve efficient in-memory data management. 1. OBJECTIVE AND EXPERIMENTAL METHODOLOGY Objective. Given the explosion of Big Data analytics, it is im- portant to understand the performance costs and limitations of ex- isting approaches for in-memory data management. Broadly, in- memory data management covers two main types of roles: (i) sup- porting analytics operations and (ii) supporting storage and retrieval operations on arbitrary objects. This paper proposes a performance study of both analytics and key-value object operations on three popular systems: Memcached [2], Redis [3] and Spark’s RDD [7]. Workloads setup. To test the analytics performance, we use the PageRank algorithm implemented in a Map/Reduce style. In the Map phase, we compute the contributed rank for the neighbors of every web page, and distribute this information to other nodes. In the Reduce phase, each node computes the new ranks for the local web pages based on the contributed ranks. Spark naturally supports Map/Reduce computations, and we use the default PageRank implementation shipped as part of Spark 0.8.0 examples. RDDs are persisted into memory before we use it. We use Spark 0.8.0/Scala 2.9.3 with Java 1.7.0. Memcached is a key-value store that only supports operations such as set/get. To implement PageRank algorithm on top of Mem- cached, we implement a driver program to do the computations. The driver uses Sypmemcached Client 2.10.3 [4] to connect to the This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- mission prior to any use beyond those covered by the license. Contact copyright holder by emailing info@vldb.org. Articles from this volume were invited to present their results at the 40th International Conference on Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 10 Copyright 2014 VLDB Endowment 2150-8097/14/06. Figure 1: Analytics Architecture over Memcached and Redis Memcached servers. Specifically, a driver program is hosted inside each Memcached server node to manage its local server and com- municate with remote servers. We coordinate all driver programs with a master program that instructs the drivers into map and reduce steps. The TCP protocol is used for the communication between all Memcached servers and the drivers. We use Memcached 1.4.15 compiled using gcc 4.6.3 with the default settings. Figure 1 shows the architecture of the analytics operations on top of Memcached. Like Memcached, Redis is a key-value store with basic set/get operations and a set of advanced functions such as pipelined oper- ations, server-side scripting, and transactions. A similar setup as Memcached is used for Redis, as described in Figure 1, termed Re- dis client-side. The driver uses Aredis Client 1.0 [1] to connect to the servers. Unlike Memcached, Redis supports server-side script- ing. Thus, the PageRank processing can be done directly by the Redis servers via Lua scripts, without relying on a driver program. We refer to this manner as Redis server-side data analytics. We use Redis 2.6.16 compiled using gcc 4.6.3 with the default settings. The implementation of PageRank in both Memcached and Redis requires one key-value object to hold the neighborhood information of each node in the graph, and one for the PageRank information of each node. Spark’s PageRank implementation uses two RDDs. The first RDD stores each graph edge as a key-value object, and the second stores the computed PageRanks. Memcached and Redis servers, and RDD worker are configured with cache size of 5 GB; Redis persistence is disabled during the experiments. We use the default number of threads for all the server systems (e.g. 4 threads for Memcached). To stress-test the perfor- mance of Memcached/Redis servers, we use drivers that support multi-threaded asynchronous connections to the servers. Based on a tuning experiment, we select the thread configurations that achieve the best performance: five threads for Memcached driver, and six threads for the Redis driver. Datasets. We run PageRank for 10 iterations using two datasets. The first dataset is Google Web Graph Dataset [5] of 875,713 nodes and 5,105,039 edges. The size of this dataset on disk is 72 MB. When loaded into a single node’s memory, it takes 85 MB in Redis, 135 MB in Memcached and 3.9 GB in RDD. The second dataset is Pokec Social Network Dataset [5] consisting of 1,632,803 nodes and 30,622,564 edges. On disk this dataset takes 405 MB, and