IEEE COMMUNICATIONS LETTERS, VOL. 8, NO. 3, MARCH 2004 183 Packet Processing Acceleration With a 3-Stage Programmable Pipeline Engine I. Papaefstathiou, K. Vlachos, N. Nikolaou, N. Zervos, and V. B. Lawrence, Member, IEEE Abstract—In this letter, we present the architecture and imple- mentation of a novel, 3-stage processing engine, suitable for deep packet processing in high-speed networks. The engine, which has been fabricated as part of a network processor, comprises of a typ- ical RISC core and programmable hardware. To assess the perfor- mance of the engine, experiments with packets of various lengths have been performed and compared against the IXP1200 network processor. The comparison has revealed that for the case study shown in this letter, the proposed packet-processing engine is up to three times faster. Moreover, the engine is simple to be fabri- cated, less expensive than the corresponding hardware cores of IXP1200 and can be easily programmed for different networking applications. Index Terms—ASIC, Network Processor, Special Purpose Processor. I. INTRODUCTION W ITH the rapid growth of Internet traffic and the increasing line rates, the execution of the various networking tasks is increasingly considered to be the main bottleneck for communications. To meet the stringent pro- cessing demands, designers are faced with two alternatives: either create a custom hardware solution (ASIC) or use a special purpose processor, called network processor (NP). The ASIC approach can achieve the desired speeds, but it is inflexible, since changes in the functionality are very limited or not permitted at all. However, since protocols continue to evolve, accommodating new features that comply with the latest standards is of significant importance. To this respect, NPs can provide the required flexibility and programmability. In this letter, we present a flexible and programmable engine that can sustain wire speed protocol processing, even for complex and high demanding networking tasks. The design can be easily embedded in any networking environment (i.e., both ASICs and NPs). It combines a typical RISC core [1] with custom-made, fully programmable hardware in a 3-stage pipeline module. In this way, the efficiency of a typical CPU is enhanced by providing the means to tailor its circuits for special tasks and, reversely, the application diversity of highly optimized hardware is significantly broadened. Using this engine, that incorporates a low cost and simple general purpose Manuscript received June 11, 2003. The associate editor coordinating the re- view of this letter and approving it for publication was Prof. K. Park. I. Papaefstathiou, N. Nikolaou, and N. Zervos are with Ellemedia Technolo- gies, Athens GR17121, Greece (e-mail: yanni@ellemedia.com; nikolaou@ ellemedia.com; nzervos@ellemedia.com). K. Vlachos was with Bell Laboratories Advance Technology EMEA, Lucent Technologies, 1200BD Hilversum, The Netherlands. V. B. Lawrence is with Bell Labs, Lucent Technologies,Holmdel, NJ 07733 USA (e-mail: vbl@lucent.com). Digital Object Identifier 10.1109/LCOMM.2004.823427 Fig. 1. Programmable processing functional model and block diagram. RISC, at 200 MHz, we were able to sustain stateful inspection firewall processing and Network Address Translation (NAT) for 2.5 Gb/s TCP/IP traffic. II. THE PROGRAMMABLE PROCESSING ENGINE—PPE The Programmable Processing Engine (PPE) (see Fig. 1) is a 3-stage pipeline module, consisting of three logical sub-units: a Field Extractor (FEX) unit, a typical RISC core and a Field Modification unit (FMO). More particularly, programmable hardware is commissioned to extract fields from incoming packets and feed them to the processing core. After the fields’ processing in the RISC core, FMO updates, in a programmable manner, specific fields of the packet. Additionally, an I/O data controller is used to relieve the processing core from I/O duties and free available resources for real processing. The Field Extraction operation is controlled by microcode, stored in an internal SRAM. The instruction set comprises of simple and generic instructions that operate over data stored in a FIFO of 32-bit words. The FEX instruction set supports the following operations: 1) variable length (1 to 32 bits) field extraction; 2) backward/forward movement in the data FIFO; 3) conditional jumps; and 4) addition. FEX instructions are flex- ible enough to allow conditional branches based on the content of extracted filed (e.g., protocol field of the IP header), as well as parsing of protocol headers based on header and packet length information (e.g., FEX can be easily programmed to recognize and extract or skip IP and TCP options). The execution time of the field extraction operation is constant and does not depend on the number of extracted bits (only on the number of the fields extracted). Packet processing is initiated by a packet arrival at the FEX input interface. After the field extraction, the I/O Data Con- troller places the extracted fields directly to the register file of the RISC core. In this way the RISC performance is significantly enhanced, as I/O operations are performed in parallel with the 1089-7798/04$20.00 © 2004 IEEE