Proceedings of the Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, California, November 7-10, 2003. Abstract—PLX FP is a floating-point instruction set architecture (ISA) extension to PLX that is designed for fast and efficient 3D graphics processing. In this paper, we explore the implementation and performance of the fundamental functional unit for PLX FP, the floating-point multiply-accumulate (FMAC) functional unit. We present simulation and synthesis results for several implementations with increasingly powerful sets of instructions, to compare area and delay tradeoffs. We also evaluate the performance tradeoffs with examples taken from the 3D graphics processing pipeline. I. INTRODUCTION With the proliferation of computer games, floating-point (FP) intensive 3D graphics processing is rapidly becoming a major component in the workload on multiple computing platforms. To address the needs of 3D graphics, we proposed PLX FP [1][2], a fully subword-parallel floating-point ISA extension to the PLX architecture [3]. It enables fast and efficient 3D graphics processing on PLX, an architecture designed from scratch for fast multimedia processing. Figure 1 Datapath for PLX FP Six classes of floating-point instructions are defined in PLX FP; arithmetic, compare, mathematical approximation, data rearrangement, data conversion, and memory instructions [1][2], as shown in Table 1. To execute the PLX FP instructions, a new FP datapath (Figure 1) needs to be added to the base PLX architecture. It includes a separate FP register file, a new set of functional units, and the corresponding databuses. Each of the new functional units handles certain classes of PLX FP instructions. The FMAC unit executes the arithmetic and compare instructions. A novel feature of PLX FP is the introduction of two new types of arithmetic instructions: the FP scale and dot product instructions. These instructions can effectively speed up vector scaling and dot product operations, which are common in 3D graphics. In this paper, we explore the implementations of the FMAC unit to study area and delay tradeoffs, evaluate the incremental cost of the new scale and dot product instructions, and study the performance impact of different FMAC implementations on 3D graphics processing. Like integer PLX, PLX FP is also datapath scalable. A register can hold one, two, or four subwords, each containing a 32-bit IEEE single-precision FP data. Correspondingly, the datapath size can be 32-bit, 64-bit, or 128-bit, suitable for meeting different cost and performance targets. We assume 128-bit FP datapath in this paper, because the 4-element vector is the most commonly used data type in 3D graphics. The paper is organized as follows. Section II surveys the past work related to PLX FP and FMAC implementations. Section III studies the implementations of the FMAC unit and presents simulation and synthesis results. Section IV evaluates performance and analyzes the results. Finally we conclude in Section V. TABLE 1: PLX FP INSTRUCTIONS Arithmetic (Vector) Compare pfcmp.rel fcmp.rel fcmp.rel.pw1 Math approximation frcpa frcpsqrta flog2a fexp2a Rearrangement padd psub pmul pfmuladd pfmulsub pfabs pfmax pfmin pfscale,j pfscaleadd,j pfscalesub,j pfdp pfdp.s padd.neg pmul.neg pfmuladd.neg pfmulsub.neg pfabs.neg pfscale.neg,j pfscaladd.neg,j pfscalesub.neg,j pfdp.neg pfdp.s.neg Arithmetic (Scalar) fmix.l fmix.r fpermute fextract fdeposit Conversion fadd fsub fmul fmuladd fmulsub fabs fmax fmin fadd.neg fmul.neg fmuladd.neg fmulsub.neg fabs.neg Memory fload floadx fload.u floadx.u fstore fstorex fstore.u fstorex.u pfcvti pfcvtu picvtf puicvtf fcvti fcvtui icvtf uicvtf Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics Xiao Yang 1 , Shamik K. Valia 2 , Michael J. Schulte 2 , and Ruby B. Lee 1 1 Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 2 Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53706