High-Performance Integer Factoring with Reconﬁgurable Devices Ralf Zimmermann, Tim Güneysu and Christof Paar Horst Görtz Institute for IT-Security, Ruhr-University Bochum, Germany Email: {zimmermann,gueneysu,cpaar}@crypto.rub.de Abstract—We present a novel FPGA-based implementation of the Elliptic Curve Method (ECM) for the factorization of medium-sized composite integers. More precisely, we demon- strate an ECM implementation capable to determine prime factors of up to 2,424 151-bit integers per second using a single Xilinx Virtex-4 SX35 FPGA. Using this implementation on a cluster like the COPACOBANA is beneﬁcial for attacking cryptographic primitives like the well-known RSA cryptosystem with advanced methods such as the Number Field Sieve (NFS). To provide this vast number of integer factorizations per FPGA, we make use of the available DSP blocks on each Virtex- 4 device to accelerate low-level arithmetic computations. This methodology allows the development of a time-area efﬁcient design that runs 24 ECM cores in parallel, implementing both phase 1 and phase 2 of the ECM. Moreover, our design is fully scalable and supports composite integers in the range from 66 to 236 bits without any signiﬁcant modiﬁcations to the hardware. Compared to the implementation by Gaj et al., who reported an ECM design for the same Virtex-4 platform, our improved architecture provides an advanced cost-performance ratio which is better by a factor of 37. Index Terms—Factorization, elliptic curve method, reconﬁg- urable hardware, COPACOBANA. I. I NTRODUCTION In 1987, the Elliptic Curve Method (ECM) was introduced by H. W. Lenstra [1] as a new method for integer factorization, generalizing the concept of Pollard’s p − 1 and Williams’ p +1 method [2], [3]. Although the ECM is known not to be the fastest method for factorization with respect to asymptotical time complexity, it is widely used to factor composite numbers up to 200 bits due to its very limited requirements on memory. The most prominent application that relies on the hardness of the factorization problem is the RSA cryptosystem. An attacker on RSA has to ﬁnd the factorization of a composite number n which consists of two large primes p, q. More precisely, the RSA security parameter n is larger than 1024 bits and hence out of reach of the ECM. Up to date, such large bit sizes are preferably attacked with the most powerful methods known so far, such as the Number Field Sieve (NFS). However, the complex NFS 1 involves the search of relations in which many mid-sized numbers need to be tested if they are "smooth", i.e., composed only of small prime factors not 1 The NFS comprises of four steps, the polynomial selection, relation ﬁnding, a linear algebra step and ﬁnally the square root step. The relation ﬁnding step is most time-consuming, taking roughly 90% of the runtime. For more information on the NFS refer to [4]. larger than a ﬁxed boundary B. In this context, ECM is an important tool to determine the smoothness of such integers (i.e., if they can be factored into small primes), in particular due to its moderate resource requirements. The fastest ECM implementations for retrieving factors of composite integers are software-based; a state-of-the-art system is the GMP-ECM software published by P. Zimmer- mann et al. [5] and has been extended for use with GPUs by Bernstein et al. [6]. As a promising alternative, efﬁcient hardware implementations of the ECM were ﬁrst proposed in 2005: Šimka et al. [7] demonstrated the feasibility to imple- ment the ECM in reconﬁgurable hardware by presenting a ﬁrst proof-of-concept implementation. Their results were improved by Gaj et al. [8], [9], who also showed a complete hardware implementation of ECM phase 2. However, the low-level arith- metic in these implementations were only implemented using straightforward techniques within the conﬁgurable logic which yet leaves room for further improvements. To ﬁll this gap, de Meulenaer et al. [10] proposed an unrolled Montgomery multiplier based on a two-dimensional pipeline on Xilinx Virtex-4 FPGAs to accelerate the ﬁeld arithmetic. However, due to limitations in area and the long pipeline design, their design only efﬁciently supports the ﬁrst phase of the ECM. Contribution: In this work we propose a novel ECM archi- tecture for Xilinx Virtex FPGAs making use of DSP blocks for the computationally intensive arithmetic. Our focus is to accelerate the underlying ﬁeld arithmetic of the ECM on FPGAs without sacriﬁcing the option to combine both phase 1 and 2 in a single core. Thus, we adopt some high-level decisions like memory-management and the use of SIMD instructions from [8] which also supports both phases on the same hardware. To improve the ﬁeld arithmetic, we place fundamental arithmetic functions like adders and multipliers in embedded DSP blocks of modern FPGAs. For factoring large amounts of numbers, we ﬁnally describe our factor- ization setup based on a variant of COPACOBANA (Cost Optimized PArallel COde Breaker) - a cluster system based on FPGAs [11], [12]. Outline: We start with a short review on the mathematical background and the concept of the ECM. In Section III we ﬁrst describe the cluster system COPACOBANA, which represents the target platform of our work, and then discuss the architecture of an ECM core and its corresponding arithmetic components. Finally, we present our factorization results in Section IV before we conclude with Section V. 2010 International Conference on Field Programmable Logic and Applications 978-0-7695-4179-2/10 $26.00 © 2010 IEEE DOI 10.1109/FPL.2010.26 83 2010 International Conference on Field Programmable Logic and Applications 978-0-7695-4179-2/10 $26.00 © 2010 IEEE DOI 10.1109/FPL.2010.26 83 2010 International Conference on Field Programmable Logic and Applications 978-0-7695-4179-2/10 $26.00 © 2010 IEEE DOI 10.1109/FPL.2010.26 83