Speeding Up AES By Extending a 32 bit Processor Instruction Set
Guido Marco Bertoni
ST Microelectronics
Agrate Briaznza, Italy
bertoni@st.com
Luca Breveglieri
Politecnico di Milano
Milano, Italy
breveglieri@elet.polimi.it
Farina Roberto
CEFRIEL - Politecnico di Milano
Milano, Italy
roberto.farina@cefriel.it
Francesco Regazzoni
ALaRI, University of Lugano
Lugano, Switzerland
regazzoni@alari.ch
Abstract
Nowadays the need of speed in cipher and decipher op-
erations is more important than in the past. This is due to
the diffusion of real time applications, which fact involves
the use of cryptography.
Many co-processors for cryptography were studied and
presented in the past, but only few works were addressed
to the enhancement of the instruction set architecture (ISA)
of the embedded processor. This paper presents an exten-
sion of the ISA of a 32 bit processor, that aims at speed-
ing up the software implementations of the AES algorithm.
After the identification of the most frequently executed and
the most time consuming sections of the algorithm, a set of
dedicated instructions is designed in order to improve the
performances of the cipher operations. We validate our in-
struction set extension by measuring the speed up for differ-
ent optimized implementations of AES using an ARM pro-
cessor simulator, but the enhancements we propose are gen-
eral enough to be applied to almost all 32 bit processors.
1 Introduction
The Advanced Encryption Standard (AES) [8] is becom-
ing the default choice for secure data communication in
current and future embedded systems, where performances
of AES are of crucial importance to fulfill efficiency con-
straints. Although different implementations of optimized
coprocessor were proposed ([6] and [10] are possible ex-
amples), only recently works started to be addressed to the
enhancement of the instruction set architecture (ISA) of the
embedded processor.
In this paper, we introduce specific instructions designed
to speed up the AES algorithm that can be applied to almost
all 32 bit processors. These instructions are cost-effective,
because they share resources, such as registers and data
path, with the existing general purpose processor. Two pos-
sible set of instructions are analyzed, the ones byte oriented
and the ones word oriented. SMix and Sbox belong to the
first group, and are tough to speed up the round operations
and the substitution table of the key unrolling. In the second
group can be found SMixW, SubWord and KSFW: the first
two are for the round operations, while the last is specifi-
cally designed for the key unrolling. Using an ARM simu-
lator, we estimate that the performance improvement is be-
tween 1.43 and 3.45, depending on the instruction set used.
The rest of the paper is organized as follow: Section 2
discusses some related works. Section 3 summarizes the Ri-
jndael algorithm. Section 4 presents our proposal for speed-
ing up AES on a 32 bit processor. Section 5 shows the new
assembly code and presents the performance results we ob-
tained. Section 6 concludes the paper.
2 Related work
The idea of extending a general purpose-instruction set
architecture for performance critical operation of AES has
been addressed in some previous work.
[1] explores general techniques to improve the perfor-
mance of eight popular symmetric key cipher algorithms.
New instructions are introduced with the goal to support fast
substitutions, general permutations, rotations, and modular
arithmetic. Performance analysis of the optimized ciphers
shows an overall speedup between 59% and 74%.
[7] presents instructions to calculate the value of a T ta-
ble entry. Although implementations that use this instruc-
tions are fast, the proposed functional unit has a longer criti-
cal path, since it performs four SubBytes using a single sub-
stitution table. Moreover, the instruction presented cannot
Application-specific Systems, Architectures and Processors (ASAP'06)
0-7695-2682-9/06 $20.00 © 2006