Speeding Up AES By Extending a 32 bit Processor Instruction Set Guido Marco Bertoni ST Microelectronics Agrate Briaznza, Italy bertoni@st.com Luca Breveglieri Politecnico di Milano Milano, Italy breveglieri@elet.polimi.it Farina Roberto CEFRIEL - Politecnico di Milano Milano, Italy roberto.farina@cefriel.it Francesco Regazzoni ALaRI, University of Lugano Lugano, Switzerland regazzoni@alari.ch Abstract Nowadays the need of speed in cipher and decipher op- erations is more important than in the past. This is due to the diffusion of real time applications, which fact involves the use of cryptography. Many co-processors for cryptography were studied and presented in the past, but only few works were addressed to the enhancement of the instruction set architecture (ISA) of the embedded processor. This paper presents an exten- sion of the ISA of a 32 bit processor, that aims at speed- ing up the software implementations of the AES algorithm. After the identification of the most frequently executed and the most time consuming sections of the algorithm, a set of dedicated instructions is designed in order to improve the performances of the cipher operations. We validate our in- struction set extension by measuring the speed up for differ- ent optimized implementations of AES using an ARM pro- cessor simulator, but the enhancements we propose are gen- eral enough to be applied to almost all 32 bit processors. 1 Introduction The Advanced Encryption Standard (AES) [8] is becom- ing the default choice for secure data communication in current and future embedded systems, where performances of AES are of crucial importance to fulfill efficiency con- straints. Although different implementations of optimized coprocessor were proposed ([6] and [10] are possible ex- amples), only recently works started to be addressed to the enhancement of the instruction set architecture (ISA) of the embedded processor. In this paper, we introduce specific instructions designed to speed up the AES algorithm that can be applied to almost all 32 bit processors. These instructions are cost-effective, because they share resources, such as registers and data path, with the existing general purpose processor. Two pos- sible set of instructions are analyzed, the ones byte oriented and the ones word oriented. SMix and Sbox belong to the first group, and are tough to speed up the round operations and the substitution table of the key unrolling. In the second group can be found SMixW, SubWord and KSFW: the first two are for the round operations, while the last is specifi- cally designed for the key unrolling. Using an ARM simu- lator, we estimate that the performance improvement is be- tween 1.43 and 3.45, depending on the instruction set used. The rest of the paper is organized as follow: Section 2 discusses some related works. Section 3 summarizes the Ri- jndael algorithm. Section 4 presents our proposal for speed- ing up AES on a 32 bit processor. Section 5 shows the new assembly code and presents the performance results we ob- tained. Section 6 concludes the paper. 2 Related work The idea of extending a general purpose-instruction set architecture for performance critical operation of AES has been addressed in some previous work. [1] explores general techniques to improve the perfor- mance of eight popular symmetric key cipher algorithms. New instructions are introduced with the goal to support fast substitutions, general permutations, rotations, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup between 59% and 74%. [7] presents instructions to calculate the value of a T ta- ble entry. Although implementations that use this instruc- tions are fast, the proposed functional unit has a longer criti- cal path, since it performs four SubBytes using a single sub- stitution table. Moreover, the instruction presented cannot Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006