An Improved Unified Scalable Radix-2 Montgomery Multiplier David Harris Ram Krishnamurthy, Mark Anders, Sanu Mathew, and Steven Hsu Harvey Mudd College Intel Circuits Research Laboratory David_Harris@hmc.edu Ram.Krishnamurthy@intel.com Abstract This paper describes an improved version of the Tenca-Koç unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koç multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2 n ) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported. 1. Introduction Multiplication in a finite field is essential to many encryption algorithms including RSA, Diffie-Hellman key exchange, the Digital Signature Algorithm, and elliptic curve cryptography [1]. The two common finite Galois fields are GF(2 n ), used for elliptic curves, and GF(p), used for most other algorithms. Multiplication in a prime field GF( p) is performed modulo some prime p. Multiplication in a binary extension field GF(2 n ) is performed modulo some irreducible polynomial f( x) of degree n. It is implemented identically to GF(p) except that carries are not propagated. Therefore addition reduces to the XOR operation. Cryptographic computations are time -consuming because they operate on precisions of 256 to 2048 or more bits and require large numbers of multiplications to perform exponentiation. The Montgomery multiplication algorithm [2] is commonly used because it avoids division by the modulus. Many software and hardware implementations of Montgomery multiplication have been proposed. Software uses repeated multiplication and addition instructions [3, 4]. Radix-2 hardware designs operate in a word-serial fashion with addition as the basic operation [5]. Higher-radix designs use fewer cycles at the expense of requiring multiplications or memories containing precomputed multiples [6, 7, 8]. Hardware designs are said to be scalable if they can work on variable precision limited only by memory capacity. They are unified if they handle both GF(p) and GF(2 n ) on the same array [9]. This paper proposes an improvement on the Tenca-Koç scalable unified radix-2 Montgomery multiplier [5] with half the latency for small and moderate-precision operands. The paper begins by reviewing Montgomery multiplication and the Tenca- Koç algorithm. It then describes how to left-shift input operands rather than right-shift results to avoid a bottleneck waiting for the most significant bit of each result word. The queue size is also cut in two by converting results to nonredundant format before storing them. Delay, area, and power results for a Verilog implementation synthesized to a Xilinx FPGA are discussed. 2. Montgomery Multiplication We would like to compute Z = X × Y mod M, where the operands have n bits of precision and M is an odd number in the range 2 n-1 < M < 2 n . In GF(p), M is the prime p. In GF(2 n ), M is a binary representation of an irreproducible polynomial and carries are not propagated between columns in the multiplication. The modulo operation is expensive because it involves division. Montgomery [2] observed that the divisions can be converted into simple shifts if multiplication is instead performed on so-called Montgomery residues (M-residues ). The M-residue of an integer a (0 = a < M) is defined to be mod a ar M = where r = 2 n . For example, if r = 16 and M = 11, then we see 3 3 16 mod 11 4 = × = . There is an isomorphism between integers in this range and their Montgomery residues. The modular multiplicative inverse b -1 of an integer b is that number such that bb -1 mod M = 1. For