MEMSCALE TM : a Scalable Environment for Databases ector Montaner , Federico Silla , Holger Fr¨ oning , and Jos´ e Duato Universitat Polit` ecnica de Val` encia, Departament d’Inform` atica de Sistemes i Computadors Camino de Vera, s/n 46022 Valencia, Spain. hmontaner@gap.upv.es, {fsilla,jduato}@disca.upv.es University of Heidelberg, Computer Architecture Group B6, 26, Building B (3rd floor) 68131 Mannheim, Germany. froening@uni-hd.de Abstract—In this paper we propose a new memory architec- ture for clusters referred to as MEMSCALE. This architecture provides a distributed non-coherent shared-memory view of the memory resources present in the cluster. With this aggregation technique, a given processor can directly access any memory address located at other nodes in the cluster and, therefore, the whole memory present in the cluster can be granted to a single application. In this study we focus on in-memory databases as a memory- hungry application in order to show the possibilities of our new architecture. To prove the feasibility of our idea, a 16- node prototype cluster serves as a demonstrator. Part of the memory in each node is used to create a global memory pool of 128GB which hosts an entire database. First we show that providing more memory than usually available in a typical commodity node for a database server makes the execution of queries more than one order of magnitude faster than using regular SSD drives. After that, we go one step further and show that simultaneously accessing the database from all the nodes in the cluster converts our prototype into a powerful database server capable of beating current commercial solutions in terms of latency and throughput. Keywords: memory architecture, cluster computer, non- coherent memory, in-memory databases I. I NTRODUCTION Commodity computers have become the common building block for scalable high performance computing. As a matter of fact, 83% of the systems included in the Top500 list are cataloged as cluster computers [1]. The main reason is that clusters based on commodity computers are noticeably much more cost-effective than their counterpart massive parallel processing systems. However, the cluster architecture partitions the system memory into isolated pieces, each one located at a different node. In this way, communication among nodes is resolved by exchanging messages, although this access to foreign memory undergoes extra overhead caused by the message handling layer (in addition to the higher latency due to the distance between nodes). Nevertheless, this paradigm is commonly used by MPI-based applications. This innate latency in the process of gaining access to remote memory through a message, together with the required extra effort when dealing with explicit messages on the programmer side, encourages the use of shared-memory applications where possible. However, as a processor can only directly access memory allocated at its node, the habitat of a shared-memory application is restricted to a single motherboard, thus hindering its use across a cluster. However, the current trend in the number of cores per socket alleviates the previous restriction in the sense of computing power resources: nowadays, it is easy to configure a moth- erboard with 32 cores and, as this number will increase up to 80 cores in the near term, the number of execution flows hosted in a single node can be quite high. However, note that many shared-memory applications do not scale beyond a few tens of threads [2], either because of synchronization problems or because unbalances in the system such as I/O bottlenecks in some data-intensive applications. But this situation changes with regard to memory needs, which are a harder requirement than the computing power one: a decrease in the number of available cores produces a linear increase in execution time, but a decrease in the amount of available memory produces an exponential in- crease in execution time. This behavior is due to the fact that secondary memory storage makes up for the lack of main memory, although their performance differs in several orders of magnitude. This is why memory is overscaled at each node in clusters, just to prevent the critical situation where an application runs out of main memory. However, most of the time this just-in-case memory remains idle (but consuming power). This economic cost and energy inefficiency is not the only problem. As described in [3], current trends in DIMM technology predict that the amount of available memory per core will drop by 30% every two years. This means that applications will become more and more memory restricted and, thus, a remedy for the memory capacity wall seems to be urgent. We proposed a solution in [4][5] to increase the available memory to an application by leveraging main memory from the other nodes in a cluster. As we explain later, this approach, called MEMSCALE TM , can be seen as a memory aggregation mechanism that does not require coherency among nodes in the cluster because the global memory pool is treated as an exclusive distributed memory, that is, only one application located in one of the nodes can use this memory at a time. In this paper we apply our remote memory architecture to databases and analyze how this kind of applications can benefit from a large main memory pool. By nature, databases present an insatiable need for mem- ory. Due to the large amount of data that these applica- tions usually handle, tables have been traditionally stored in secondary memories like hard disks. However, as the 2011 IEEE International Conference on High Performance Computing and Communications 978-0-7695-4538-7/11 $26.00 © 2011 IEEE DOI 10.1109/HPCC.2011.51 339