Strategies for Dynamic Memory Allocation in Hybrid Architectures Peter Bertels peter.bertels@ugent.be Wim Heirman wim.heirman@ugent.be Dirk Stroobandt dirk.stroobandt@ugent.be Department of Electronics and Information Systems Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium ABSTRACT Hybrid architectures combining the strengths of general- purpose processors with application-specific hardware ac- celerators can lead to a significant performance improve- ment. Our hybrid architecture uses a Java Virtual Machine as an abstraction layer to hide the complexity of the hard- ware/software interface between processor and accelerator from the programmer. The data communication between the accelerator and the processor often incurs a significant cost, which sometimes annihilates the original speedup ob- tained by the accelerator. This article shows how we min- imise this communication cost by dynamically chosing an optimal data layout in the Java heap memory which is dis- tributed over both the accelerator and the processor mem- ory. The proposed self-learning memory allocation strategy finds the optimal location for each Java object’s data by means of runtime profiling. The communication cost is ef- fectively reduced by up to 86% for the benchmarks in the DaCapo suite (51% on average). Categories and Subject Descriptors B.3.2 [Memory Structures]: Design Styles—Shared mem- ory ; D.3.4 [Programming Languages]: Processors—Mem- ory management ; D.4.2 [Operating Systems]: Storage Management—Distributed memories General Terms Algorithms, Experimentation, Performance Keywords Hardware acceleration, Java, Memory management 1. INTRODUCTION Hardware accelerators or other application-specific copro- cessors are used to improve the performance of computa- tionally intensive programs. Large speedups are achieved Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CF’09, May 18–20, 2009, Ischia, Italy. Copyright 2009 ACM 978-1-60558-413-3/09/05 ...$5.00. by exploiting massive parallelism in hot code segments [3]. Recently, the same methodology has been applied to Java programs [4]. Two main directions can be identified in this domain: acceleration of the Java Virtual Machine (JVM) itself and acceleration of specific methods of Java programs. In the first approach additional hardware provides specific Java functionality such as thread scheduling and garbage collection or the translation of bytecode to native instruc- tions [9]. A further evolution of this idea is a Java-specific processor that natively executes bytecode [10]. In the sec- ond approach, an application-specific hardware accelerator executes in parallel with the main processor. Additionally, we consider the hardware as an integral part of the JVM rather than an application-software controlled device. An advantage of this approach is that the hardware accelera- tion is transparent to the Java programmer. If the hard- ware accelerator is reconfigurable, functionality can even be moved dynamically from the general-purpose processor to the accelerator [5]. Hardware execution is then an addi- tional optimisation step in the just-in-time compiler, where the hardware configuration can be loaded from a library [2] or even generated on-the-fly [6]. In this article we concentrate on the latter approach be- cause it uses the hardware only for specific functions where a significant speedup can be obtained as has been shown in several hardware implementations [3]. Less hardware- friendly methods are left in software which contrasts to the first approach where often a significant amount of hardware resources needs to be allocated for functionality like memory management, scheduling . . . Our approach leads to a hard- ware accelerated JVM which is described in Section 2. In such a hybrid architecture, the JVM has to manage schedul- ing and memory allocation, also for functionality executed in hardware. Java’s shared-memory model now extends to the accelerator’s local memory. The JVM must be extended such that both native Java code and the accelerator can access all objects, independent of their physical location. An important task of the JVM is the placement of ob- jects in the distributed Java heap. Since the accelerator is usually connected through a relatively slow communica- tion medium, remote memory accesses are costly and should thus be avoided as much as possible. To this end, the JVM should allocate objects in the memory region closest to the most prolific user of the data. This way, data private to a thread is always in local memory, which minimises com- munication overhead. A static analysis is not sufficient for solving the object placement problem as it can only estimate which data are private to a method conservatively. More- 217