A Decimal Fully Parallel and Pipelined Floating Point Multiplier Ramy Raafat 1 , Amira M. Abdel-Majeed 1 , Rodina Samy 1 Tarek ElDeeb 1 , Yasmin Farouk 1 , Mostafa Elkhouly 1 Hossam A. H. Fahmy 2 1 SilMinds, LLC. Smart Village, B115, 12577, Giza, Egypt. 2 Electronics and Communication Department, Cairo University, Giza, Egypt. ramy.raafat@silminds.com Abstract- Decimal arithmetic is important in several commercial applications including financial analysis, banking, tax calcula- tion, currency conversion, insurance, and accounting. This paper presents a fully parallel Decimal64 floating point (FP) multiplier compliant to IEEE Std 754-2008 for floating point arithmetic. The proposed multiplier possesses novel methods to target low latency. The proposed design is based on a previously published fixed point multiplier [1] that uses a novel BCD4221 recoding for decimal digits to improve the area and latency of the partial product generation and the partial product reduction tree. Sev- eral enhancements are introduced to the design; the final carry propagation adder is implemented using a fully parallel decimal adder with a Kogge-Stone prefix tree, the sticky bit is generated in parallel to the shifter to reduce the critical path delay. The design is extendable to support Decimal128 floating point multip- lication. The multiplier is hardware verified for functionality on an FPGA. I. INTRODUCTION Decimal arithmetic received an increased attention in the last decade because of its growing need in many commercial applications and database systems where the binary arithmetic is not sufficient. The arithmetic operations in these applica- tions need to be executed in decimal format. This is because the inexact mapping between some decimal and binary num- bers, such as 0.1, cannot be represented accurately using bi- nary format in a limited precision. This leads to an inaccuracy of floating point decimal arithmetic emulation by floating point binary arithmetic units. Decimal arithmetic software libraries have been developed to overcome the decimal to binary conversion error but they are about 100 to 1000 times slower than what can be imple- mented in hardware [2]. In the near future, Decimal floating- point (DFP) units are expected to be embedded in many pro- cessors' cores to perform the decimal operation faster than the software packages and with higher accuracy. Due to the im- portance and the growing need of the decimal arithmetic, its specifications are included in the new IEEE standard for float- ing point arithmetic (IEEE Std 754-2008) [6]. This paper introduces a decimal floating point multiplier based on radix-10 fixed point multiplier [1] that introduced an efficient implementation by the parallel generation of partial products followed by a novel carry save addition (CSA) tree to end the reduction of the partial products in Carry Save (CS) format. This carry save addition tree uses a BCD-4221 recod- ing for decimal digits to improve the area and latency. In our proposed design, a novel decimal carry propagation adder is used to add the outputs of the carry save addition tree in order to get the intermediate product. Since our design is for float- ing point multiplier, there is a need to calculate additional in- formation to correctly round the number. The shift amount calculations and sticky counter calculations are executed early to reduce the design latency. The novelty of the proposed design is that it has a low latency and low area compared to previous decimal multiplier designs. The paper is organized as follows: Section (II) contains background information about the decimal multiplication and an overview on the IEEE Std 754-2008. Section (III) explains in details the proposed multiplier design and highlights the novelty in design. Section (IV) contains testing and synthesis results emphasizing the pipelining results, followed by conclu- sions in section (V). II. BACKGROUND Decimal multiplication performs the computation, = × (1) Where A is the multiplicand, B is the multiplier, and P is the product. It is assumed that A and B are each n digits hence P is maximally 2n digits that must be rounded in order to fit in a limited precision of n digits. Several approaches to decimal multiplication are proposed, the simple and straight forward one is to iterate over the digits of the multiplier B and based on the value of the current digit, add the corresponding mul- tiple of the multiplicand A to an intermediate product. In this approach the multiplier multiples 2A through 9A must be gen- erated which consumes large area and delay. Equation 2 represents this approach to decimal multiplication. +1 =  + . . 10 1 (2) Where is the partial product, 0 = 0 and 0 i n-1. Another approach [5] is to generate secondary multiples which are a reduced set of multiples and generate any other missing multiple by adding two multiples from this secondary set based on the value of the current digit of the multiplier B. This approach reduces the complexity of generating eight mul-