Automating Custom-Precision Function Evaluation for Embedded Processors Ray C.C. Cheung Department of Computing Imperial College London London, United Kingdom r.cheung@imperial.ac.uk Dong-U Lee EE Department University of California Los Angeles, USA dongu@icsl.ucla.edu Oskar Mencer Department of Computing Imperial College London London, United Kingdom o.mencer@imperial.ac.uk ABSTRACT Due to resource and power constraints, embedded proces- sors often cannot aﬀord dedicated ﬂoating-point units. For instance, the IBM PowerPC processor embedded in Xilinx Virtex-II Pro FPGAs only supports emulated ﬂoating-point arithmetic, which leads to slow operation when ﬂoating- point arithmetic is desired. This paper presents a customiz- able mathematical library using ﬁxed-point arithmetic for elementary function evaluation. We approximate functions via polynomial or rational approximations depending on the user-deﬁned accuracy requirements. The data representa- tion for the inputs and outputs are compatible with IEEE single-precision and double-precision ﬂoating-point formats. Results show that our 32-bit polynomial method achieves over 80 times speedup over the single-precision mathemati- cal library from Xilinx, while our 64-bit polynomial method achieves over 30 times speedup. Categories and Subject Descriptors C.3 [Special-purpose and Application-based Systems]: Real-time and embedded systems; D.3.4 [Programming Languages]: Processors—code generation, optimization . General Terms Measurement, Performance, Design. Keywords Embedded systems, reconﬁgurable computing, function eval- uation, ﬁxed-point arithmetic. 1. INTRODUCTION The evaluation of elementary functions is often the perfor- mance bottleneck of many compute-bound applications [15]. Examples of these functions include logarithm log(x) and square root √ x. Evaluating such functions eﬃciently while meeting the precision requirements is particulary important Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’05, September 24–27, 2005, San Francisco, California, USA. Copyright 2005 ACM 1-59593-149-X/05/0009 ...$5.00. Our approach Instructions Instructions Instructions generation using Matlab Embedded integer processor Embedded integer processor Embedded integer processor Math co-processor Floating-point emulation (a) (b) (c) Figure 1: Overview of current and the proposed em- bedded processors. Current approach includes (a) using co-processor and (b) using ﬂoating-point em- ulation. for embedded applications, where stringent resource and power constraints are enforced. Advanced FPGAs enable the development of conﬁgurable SoC systems and high-speed function evaluation units that are customized to particular applications. As shown in Fig- ure 1(a), in embedded systems, the integer processor is usu- ally incorporated with one or more dedicated coprocessors such as a math coprocessor for fast function evaluation, which results in a tradeoﬀ between area, cost and perfor- mance. Figure 1(b) illustrates the emulated ﬂoating-point mathematical library from Xilinx [5]. In this approach, ﬂoating-point arithmetic is emulated using integer opera- tions only without the use of a coprocessor 1 . Performance degradation and code space consumption are the two major problems for using this approach. In this paper, we propose an Integer Mathematical Gen- eration tool, IMGen, which makes use of optimized ﬁxed- point (integer) arithmetic for internal computations. IEEE single and double precision ﬂoating-point formats are used for both the input and output formats such that internal computation is transparent to the users. A design generator is used to automatically select the best polynomial/rational approximation for internal computations and the degree of computation for a given error tolerance. 1 Recently, Xilinx has released the Virtex-4 FX FPGA which has an Auxiliary Processor Unit (APU) [1] that can connect the math coprocessor using FPGA fabrics. In this work, we compare the designs without using this math coprocessor.