Design space exploration of an open-source, IP-reusable, scalable floating-point engine for embedded applications Claudio Brunelli a, * , Fabio Campi b , Claudio Mucci b , Davide Rossi b , Tapani Ahonen a , Juha Kylliäinen a , Fabio Garzia a , Jari Nurmi a a Tampere University of Technology, Department of Computer Systems, P.O. Box 553, FIN-33101 Tampere, Finland b ARCES Laboratories, University of Bologna, Viale Pepoli 3/2, 40136, Italy article info Article history: Received 30 August 2007 Received in revised form 22 May 2008 Accepted 23 May 2008 Available online 3 June 2008 Keywords: Floating-point FPU Coprocessor VHDL Embedded abstract This paper describes an open-source and highly scalable floating-point unit (FPU) for embedded systems. Our FPU is fast and efficient, due to the high parallelism of its architecture: the functional units inside the datapath can operate in parallel and independently from each other. A comparison between different ver- sions of the FPU has been made to highlight how performance scales accordingly. Logic synthesis results show that our FPU requires 105 Kgates and runs at 400 MHz on a low-power 90 nm std-cells low-power technology, and requires 20 K Logic Elements running at 67 MHz of an Altera Stratix FPGA. The proposed FPU is supported by a software tool suite which compiles programs written using the C/C++ language. A set of DSP and 3D graphics algorithms have been benchmarked, showing that using our FPU the amount of clock cycles required to perform each algorithm is one order of magnitude smaller than what is required by its corresponding software implementation. Ó 2008 Published by Elsevier B.V. 1. Introduction Nowadays the need to perform more and more complex com- putations in the domain of embedded systems is becoming acute, posing challenging design problems. Considering the impellent needs of modern applications, we observed that a floating-point unit is a necessary resource to enhance system performance in many cases, since regular RISC cores usually cannot keep the pace with the requirements dictated by modern applications. Our goal is to provide a physical programmable platform which supports general-purpose processing, being also powerful enough to run heavy applications which need floating-point calculations (like for instance Dolby digital audio encoding [1], DSP algorithms (like FFT), and 3D graphics applications [2]). Among the most pop- ular 3D graphics algorithms included in the graphics pipeline we can mention Gouraud shading for lighting, the scan line algorithm for the rendering stage, and the Z-buffer: they all need floating- point calculations. These applications make extensive usage of floating-point arithmetic, thus they usually call for the implemen- tation of a hardware floating-point unit (FPU) inside the target sys- tem. We created a 3D application which uses those algorithms intensively in order to measure the benefits which come from using a FPU. Such a FPU should be portable over different technologies, in particular on FPGA devices, to enable the user to easily obtain a prototype which is suitable for debugging and benchmarking pur- poses. In developing such a system it is necessary to explore the nature of the applications which are meant to be supported, and to face problems related to the design and implementation of a large and complex system like a SoC. We designed a floating-point unit (FPU) named Milk. Milk han- dles all the floating-point instructions when the main microproces- sor executes only general-purpose and control-flow code. Milk is an IP component which can be plugged as it is inside an host sys- tem. The approach of using IP components to build a complex Sys- tem-on-Chip (SoC) is very convenient: each IP component can be designed either as a fully-custom ASIC circuit or as a block de- scribed at high abstraction level [3] using hardware description languages like VHDL. The VHDL code can be handled by dedicated logic synthesis tools, which select the best physical resources to efficiently implement the described hardware [4]. We created an IP component which is extremely flexible and easily customizable by the user, to minimize the time which may be needed to make adjustments to it. The proposed FPU is designed so that it can be easily interfaced with any RISC core, guaranteeing a broad range of portability. As a first proof we successfully interconnected it with 2 different RISC processors: the Coffee RISC core [5], developed at Tampere 1383-7621/$ - see front matter Ó 2008 Published by Elsevier B.V. doi:10.1016/j.sysarc.2008.05.005 * Corresponding author. E-mail addresses: claudio.brunelli@tut.fi (C. Brunelli), fcampi@deis.unibo.it (F. Campi), cmucci@deis.unibo.it (C. Mucci), drossi@deis.unibo.it (D. Rossi), tapani. ahonen@tut.fi (T. Ahonen), juha.p.kylliainen@tut.fi (J. Kylliäinen), fabio.garzia@tut.fi (F. Garzia), jari.nurmi@tut.fi (J. Nurmi). Journal of Systems Architecture 54 (2008) 1143–1154 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc