Not Multi-, but Many-Core: Designing Integral Parallel Architectures for Embedded Computation Mihaela Malit ¸a * , Gheorghe S ¸tefan , Dominique Thi´ ebaut Abstract Recent embedded systems have switched to fully pro- grammable parallel architectures. To make sure all corner cases usually present in real applications are supported and efficiently implemented in this switch of implementation, new solutions must be found. We introduce the integral parallel architecture (IPA) as a solution supporting intensive data computation in System-on-a-chip (Soc) implementations, fitting in a small area, and requiring low power. An IPA supports naturally all three possible styles of parallelism: data, time, and speculative. As an illustrative example, we present the BA1024 chip, a fully programmable SoC designed by BrightScale, Inc. for HDTV codecs. Its main performance figures include 60 GOP S/W att and 2 GOP S/mm 2 , representing an efficient IPA approach for embedded computation. 1 Introduction Technology is currently going through some important evolutionary trends that have been identified by sev- eral researchers, including Borkar [2], and Asanovic [1]. One trend is the slow-down of the increasing rate of the clock speed. Another one is the switch from pure standard functionality to more specific functionality in video, graphics, and performance hungry applications requiring full programmability. A third one is the re- placement of Application Specific Integrated Circuits (ASIC) by programmable Systems on a Chip (Soc) due to development costs and increasing technological dif- ficulties associated with the former. Borkar and Asanovic propose several new ap- proaches for computer architectures to respond to these changes and curb their effects: application domain ori- ented architectures in two versions: many- or multi- * St. Anselm College, mmalita@anselm.edu BrightScale Inc., gstefan@brightscale.com Smith College, thiebaut@cs.smith.edu processors, or computation type oriented architectures. Intel presents a good example of the first type of archi- tecture in its Recognition, Mining and Synthesis (RMS) white paper [2], while Asanovic provides an example of the second architecture in [1]. In this paper we propose two solutions to address some of the limitations imposed by the current technol- ogy shifts. The first is an optimized approach for low- power and small-area embedded computation in SoC. The second is a way to remove some limitations catego- rized by Asanovic [1] as the 13th Dwarf, and qualified as an “automaton-style” computation. The validity of the solutions we present here rests on two hypotheses. One is that programmable SoCs can compete with ASICs only if a fully programmable parallel architecture is used, because a circuit is an intrinsically parallel system. The second hypothesis holds that the computational model of partial recursive functions [6] must be able to treat equally well both circuits, and parallel programmable systems. Our approach naturally leads to two main results: the definition of an IPA 1 for intensive computa- tions in embedded systems, and the proposal of a more nuanced taxonomy of par- allel computation as opposed to the more struc- tural and functional approach first introduced by Flynn [5]. Both results are exemplified in the BA1024, which we believe is the first embodiment of an IPA. The BA1024 is initially targeted to the HDTV market, but because of its fully programmability can support other applica- tions. Parallel computation is becoming ever more ubiq- uitous, and manifests itself in two extreme forms, one in complex computation and the other in intense data- parallel computation. Our paper deals with the second 1 The reader is invited to see our approach as being different from the heterogeneous computing systems which are those with a range of diverse computing resources that can be local to one another or geographically distributed. 1