DESIGN AND SIMULATING A SPECIALIZED EMBEDDED CORES FOR UDP NETWORK INTERFACE PROCESSING Mohamed Elbeshti, Mike Dixon, Terry Koziniec School of Engineering and Information Technology, Perth, WA, Australia m.elbeshti, mike.dixon, t.koziniec@murdoch.edu.au ABSTRACT The speed of Ethernet networks has increased to 40- 100 Gbps since the release of IEEE P802.3ba. Enhancing the protocol processing at the end node is essential to meet the demands of the increased network speeds. This research presents an enhanced pre-packet processing for inbound and outbound processing using a scalable Network Interface-based three-pipeline Embedded Processor. The designed Network Interfaces uses a specialized cost-effective 760 MHz embedded processor core can support a wide range of received UDP/IP packets, up to 100 Gbps. A 430 MHz Embedded Processor can be used for the send side. Furthermore, we have provided a processing methodology for Large Receive Offload and Large Send Offload that can contribute to pre-packet processing and work with fewer headers and data transfer from the network interface. KEY WORDS Large (Send-Recive) offload; Embbed Processor cores; UCP/IP; VHDL simulator; Cycle-accurate performance evaluations; Network Interface. 1. Introduction UDP-based protocols are needed as a short-term solution to effectively use kernel space transport protocols for high-speed networks [23]. Protocol processing for incoming or outgoing packets requires a large part of a CPU processor cycles [12, 13]. Shifting part of protocol processing to the network interface (NI) is commonly used to reduce the amount of the processing requirements at the host CPU [14]. However, offloading the protocol functions to NI moves the network bottleneck towards the core engine in the NI. Enhancing the NI performance and reducing the protocol time processing have become priorities in satisfying the demand for high-speed networks. In this paper a novel technique for Large Receive Offload (LRO) function [17, 18] is presented in the NI. This approach is to amalgamate the incoming UDP/IP packets that belong to the same Internet Protocol (IP), port IDs (PID) and Identification to form a single large packet inside the NI buffer before sending the large packet to the protocol stack for further processing. The LRO has been extended to manage out-of-order packets. Another contribution is the enhancement of the Large Send Offload (LSO) methodology to manage UDP/IP fragments into Maximum Transmission Unit (MTU) messages. Increasing the core processor to 5 GHz [2] or using multi-cores in the NI to achieve the 10 Gbps [4] can be utilized as a solution for 10 Gbps. However, today many cost effective embedded cores have become available and can be ported to the NI (e.g. Intel IXP800 and EZChip’s NP-1-4 processor). However, these processors are used for high speed networks using multi-processing to meet the requirements for 100 Gbps processing. In addition, off-the-shelf processors are not optimized for LRO and LSO functions. Since these processors are designed to support other general functions, the control unit has to support general functions, complex instructions and long and variable execution times. These general purposes CPUs have a large number of registers to accommodate all the possible uses. The goal of the research is to design a Network Interface to support our algorithms for LRO and LSO for High-speed network, up to 100 Gbps. In addition, investigating of using two specialized cores; one for LRO and the other for LSO requiring with lower hertz to support communication speed 40 and 100 Gbps networks. 2. Large Receive Offload Enhancements Large Receive Offload (LRO) was designed as a software driver in the Linux platform. Intel has used the virtual LRO to reduce the number of arriving UDP/IP packets. The virtual LRO combines the packets from the same stream into larger-sized packets inside a host memory Socket buffer (SKB) by generating SKBs only for the first packet of a LRO session. The virtual LRO does not support out-of-order packet processing and instead stores the packets as a separate SKB if they do not match the LRO requirements. This approach benefits the receiving-side, but the host CPU spends a number of cycles [3] to run the virtual LRO. In addition, the host CPU requires processing of the small packets that do not match the LRO’s criteria, such as the out-of-order packets. Reordering packets is quite common in networks [19] and the host CPU is required to handle all the out-of-order packets. With virtual LRO, the DMA initiations are required to pass packets from the network interface’s buffer through the system bus to its user space in the host memory. Each 32 KB leads to several DMA initiations, Proceedings of the IASTED International Conference July - , 201 Banff, Canada Modelling and Simulation (MS 2013) 17 19 3 DOI: 10.2316/P.2013.802-040 320