ACAM: Approximate Computing Based on Adaptive Associative Memory with Online Learning Mohsen Imani † , Yeseong Kim † , Abbas Rahimi ‡ , Tajana Rosing † † Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA ‡ Electrical Engineering and Computer Science, UC Berkeley, Berkeley, CA 94720, USA {moimani, yek048, tajana}@ucsd.edu, abbas@eecs.berkeley.edu ABSTRACT The Internet of Things (IoT) dramatically increases the amount of data to be processed for many applications including multimedia. Unlike traditional computing environment, the workload of IoT significantly varies overtime. Thus, an efficient runtime profiling is required to extract highly frequent computations and pre-store them for memory-based computing. In this paper, we propose an approximate computing technique using a low-cost adaptive associative memory, named ACAM, which utilizes runtime learning and profiling. To recognize the temporal locality of data in real-world applications, our design exploits a reinforcement learning algorithm with a least recently use (LRU) strategy to select images to be profiled; the profiler is implemented using an approximate concurrent state machine. The profiling results are then stored into ACAM for computation reuse. Since the selected images represent the observed input dataset, we can avoid redundant computations thanks to high hit rates displayed in the associative memory. We evaluate ACAM on the recent AMD Southern Island GPU architecture, and the experimental results shows that the proposed design achieves by 34.7% energy saving for image processing applications with an acceptable quality of service (i.e., PSNR>30dB). Keywords Approximate computing, Associative memory, Online learning, Non-volatile memory 1. INTRODUCTION Going toward the Internet of Things (IoT) and the big data computation significantly increases the size of input data on the recent processors. In this era, many IoT workloads are going to be run on the GPUs in either mobiles or the clouds such as data centers. In particular, multimedia processing as an instance of IoT workload have rapidly proliferated, and to achieve timely performance demand, they require to be accelerated using efficient massive parallel processors [1, 2]. In addition, due to locality of dataset, similar computations repeatedly happen, thus giving an opportunity to significantly reduce the amount of computations based on memory-based computations [3]. To this end, an associative memory in the form of a lookup table has been exploited to reduce the number of redundant computations. A software implementation pre-stores frequent patterns on a hash table and retrieves them using a set of keys that replace original computations. In order to enhance the performance of the lookup table, associative memories can be implemented in hardware using ternary content addressable memory (TCAM). However, to utilize TCAMs in computation-with-memory [4], there are two technical challenges. First, the system design has to consider the actual workloads which keep changing rapidly over different contexts such as time, place, and applications. Market research shows significant growth on interactions with external environment using sensor employments. Therefore, it is obvious that filling associative memories with offline data, on design time, cannot provide desirable hit rates [5]. Since with today’s interactive IoT workloads, we need to have a context-aware associative memory which should adapt to the environment. Therefore, runtime profiling is one the essential components of the associative memories for their practical deployment on parallel processors. Second, CMOS-based TCAMs consume very high energy for the search operation. This limits the applicability of these memories to classification and IP routing [6]. Non-volatile memories (NVMs) open a new field to have an efficient memory-based computation [7]. Resistive random access memory (ReRAM) and spin-transfer torque RAM (STT-RAM) are two kinds of low leakage and dense NVMs which are based on memristive and magnetic tunneling junction (MTJ) devices respectively. Moreover, NVM-based TCAMs can further reduce energy consumption by applying voltage over scaling (VOS) [8] or reducing the search switching activity [9]. In this paper, we propose a novel approximate computing framework using an adaptive associative memory, called ACAM, with a capability of learning-based runtime profiling. The proposed design also addresses the endurance and cost issues of associative memories for online learning, thus providing a robust and practical solution for a wide range of dynamic workloads on parallel processor architectures. Our design goal is to find the best input data with higher hit rate to adaptively fill the rows of an associative memory and improve overall energy. The learning-based profiling runs in the following steps: (i) Machine learning algorithm finds the image of interest from input dataset based on pixel similarities. The algorithm identifies the most represented data, which is likely to be used in the near future, for profiling based on the proposed TD-LRU policy. (ii) We profile the selected images of interest based on a low-cost approximate concurrent state machine to keep track of the number of repeated computations. The approximate profiling is implemented using hash functions and a bloom filter, thus enhancing energy efficiency at the expense of minimal acceptable errors. In the circuit-level design, to address the endurance and the lifetime issues caused by frequent runtime updates, ACAM exploits high endurance and robust MTJ-based TCAM and memory block. In addition, we apply approximation for a selected part of associative memory to balance the tradeoff between energy and accuracy. Thanks to the proposed method with an efficient runtime profiling, parallel processors can efficiently process a large and active dataset with a support of the adaptive associative memory. Our evaluation shows that the proposed ACAM improves the energy efficiency of GPGPU by 34.7% with acceptable PSNR (peak signal-to-noise ratio) of more than 30dB for image processing applications. 2. RELATED WORK Non-volatile memories such as ReRAM and STT-RAM are good candidate to design an efficient and low leakage power associative memories [7] [10] [11]. Earlier efforts have used these ReRAM and STT-RAM technologies to design a stable and efficient TCAM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ISLPED '16, August 08-10, 2016, San Francisco Airport, CA, USA © 2016 ACM. ISBN 978-1-4503-4185-1/16/08…$15.00 DOI: http://dx.doi.org/10.1145/2934583.2934595