MapReduce Accelerator for Embedded Applications Mihaela Malit ¸a Computer Science Department Saint Anselm College, Manchester, NH Gheorghe M. S ¸ tefan Electronic Devices, Circuits and Architectures Department Politehnica University of Bucharest, Romania Abstract—There are large classes of embedded applications involving tightly interleaved complex and intense computations. The solution we propose segregates the complex from the intense in a many-core centered engine. We base our approach on a map-reduce abstract machine model suggested by Kleene’s mathematical model of computation. An actual MapReduce Ac- celerator is described and is evaluated for various application domains. Results, based on ASIC and FPGA implementations, show > 10× improvements in area & energy use. Index Terms—many-core, accelerators-based architecture, em- bedded computation, MapReduce, computation model. I. I NTRODUCTION The current embedded accelerators are speciﬁc accelerators (for graphics, video, SDR, ...) or add hoc structured multi- or many-core engines. Our proposal is to use a many-core approach to provide a general purpose parallel engine able to perform efﬁciently all forms of intense computations requested in the embedded domain. The criteria for evaluating such an architecture are: the use of power (GOP S/W att) and the use of area (GOP S/mm 2 ). According to the evaluations made for few applications, our approach provides for both, energy and area, more than 10× improvements. The MapReduce Accelerator solution is compared with the ARM processors, the most used general purpose embedded processors. The second section explains the reason for the use of a MapReduce architecture for the intense part of the embed- ded computation. The third section describes the one-chip implementation of the MapReduce Accelerator’s structure. The fourth section reviews few classes of applications, already investigated, and shows the improved use of area and power compared with mono-core embedded computation. II. WHY MAP-REDUCE? In order to develop a general purpose parallel accelerator we propose a three-step approach: (1) consider a mathematical parallel model of computation (answering the question: what is parallel computation?), (2) deﬁne an abstract machine model (which is about how the structure of a parallel machine is orga- nized), and (3) design a parallel architecture (which provides the functional interface between the physical structure and the informational structure used to program the parallel machine). Instead of building ad hoc parallel engines, putting together Turing-based machines, we started from Stephen Kleene’s mathematical model of computation [4] which provides a genuine model for parallel computation, just as Turing’s model did for the sequential computation. We already proved that only the ﬁrst out of the three Kleene’s rules – the composition – is independent [6]. Therefore, in deﬁning real abstract machine models, the composition rule, expressed as f (x)= g(h 1 (x),h 2 (x),...h p (x)) where: x = {x 1 ,x 2 ,...x p }, is the only one to be considered. h 1 (x) h 2 (x) hp(x) ❄ ❄ ❄ ❄ ❄ ❄ x = {x 1 ,...,xn} g(h 1 (x),...h p (x)) ❄ f (x) Map level Reduce Level Fig. 1. The circuit structure associated to composition. The abstract parallel machine model results as a two- level construct suggested in Fig. 1. The map level works on correspondences between functions or vector of functions and variables or vector of variables. Three mappings result: function to vector (data-parallel), variable to vector of func- tions (speculative-parallel), vector of functions to vector of variables (thread-parallel). The reduce level computes a vector to value function. The users image of the architecture consists of two arrays: the one-dimension array of the external memory and the two- dimension array of the internal memory (a vector memory) distributed along the cells at the map level (in cell i are stored all the i-th components of the vectors). The operations on the internal two-dimension memory are predicated operations on vectors. A Boolean vector is used for predicated operations. The accelerator performs the following types of operations: scalarVect | BooleanVect <= OP(vect, vect) scalarVect | BooleanVect <= OP(vect, scal) scal | Boolean <= OP(vect) The predicated operations perform a “spatial if-then-else” along the cells of the map level. The form: where (BooleanVect) OP1(...); elseWhere OP2(...); endWhere stands for: OP1 is done in cells where BooleanVect = 1, while OP2 is done in cells where BooleanVect = 0 III. MAPREDUCE ONE-CHIP ENGINE An actual implementation of the abstract machine deﬁned above is in Fig. 2 [5], where PROCESSOR runs the complex part of the program and controls the execution of the intense