A Case for Data Caching in Network Processors Jayaram Mudigonda † , Harrick M. Vin † Raj Yavatkar ‡ † Laboratory for Advanced Systems Research ‡ Network Processor Division University of Texas at Austin Intel {jram,vin}@cs.utexas.edu raj.yavatkar@intel.com Abstract: Today’s network processors (NPs) support mechanisms to hide long memory access latencies; however, they often do not sup- port data caches that are effective in reducing average memory access latency. In this paper, we study a wide-rage of packet processing ap- plications and demonstrate that accesses to many data structures used in these applications exhibit considerable temporal locality; further, these accesses constitute a significant fraction of the total number of memory accesses made while processing a packet. Consequently, uti- lizing a cache for these data structures can (1) speedup packet pro- cessing, and (2) reduce the total off-chip memory bandwidth require- ment considerably. 1 Introduction Packet processing systems are designed to process network packets efficiently. The design of these systems is governed by two trends. First, over the past decade, the link bandwidth supported by these systems has doubled every year. Second, the diversity and complex- ity of applications supported by these systems have increased dra- matically. Today, packet processing systems support a wide-range of header-processing applications such as network address translation (NAT) [20], protocol conversion (e.g., IPv4/v6 inter-operation gate- way) [44] and firewall [3]; as well as payload-processing applica- tions such as Secure Socket Layer (SSL) [22], intrusion detection [5], content-based load balancing, and virus scanning [5]. To meet simultaneously the demands of high-performance and flexibility, a new breed of programmable processors—referred to as network processors (NPs), has emerged [18]. To achieve high packet throughput, NPs support several architectural features. For instance, most NPs support multiple processor cores to process packets con- currently. Further, each core supports hardware multi-threading; this enables NPs to hide memory access latencies by switching context from one hardware thread to another on memory accesses. However, unlike conventional general-purpose processors that rely extensively on caching to reduce average memory access latencies, NPs often do not support data caching; they expose the memory hierarchy to the programmers and expect programmers to map data structures to dif- ferent levels of the memory hierarchy explicitly. In this paper, we make a two-fold argument for supporting data caching in network processors. Data caching can be beneficial : The lack of caching in NPs is often attributed to a hypothesis that packet processing systems are required to be configured with sufficient resources to meet the worst- case traffic demands; since caching can only improve the average- case (but not the worst-case), caches are not beneficial. We first argue that this hypothesis is false. We observe that, today, 93.6% of NP-based systems are used in edge and enterprise networks [4]. These systems support com- plex applications—such as enterprise firewalls, virus-scanning, and storage-over-IP—that involve multiple packet types with different processing requirements. Further, the arrival rate of each packet type varies widely over time. Hence, provisioning such systems to meet the worst-case processing demands of all packet types is often pro- hibitively expensive. Consequently, these systems are routinely pro- visioned to service only an expected mix of complex packet types, while ensuring that the worst-case processing requirements for only the basic IP-forwarding benchmark [13] are met. In such systems, data caching—if effective—can reduce the average time required to process each packet (which, in turn, can reduce the resource provi- sioning level). Further, by improving the efficiency of utilizing sys- tem resources (e.g., memory access bandwidth, hardware threads), caching enables a packet processing system to accommodate transient deviations from the expected traffic mix—thereby leading to system designs that are robust to traffic fluctuations. Data caching is effective : Much of the prior research on illustrat- ing the benefits of caching in the context of Internet applications has focused only on basic IP forwarding [13]. Further, much of this work exploits the frequent re-occurrence of IP addresses in packet traces by caching the result of the IP address lookup (as opposed to caching the data structures used to perform the lookup) [17, 21, 30, 36]. In this paper, we analyze the locality properties exhibited by a wide-range of data structures used in modern packet processing applications, and demonstrate that data caching can be highly effective. We argue that packet processing applications access two types of data struc- tures: packet data structures and application data structures. Packet data structures (that include packet header, payload, and packet meta- data) exhibit considerable spatial locality, but little temporal locality. On the other hand, application data structures (e.g., a trie used for route lookup, a hash table used for classifying packets as belonging to flows, a pattern table used by virus scanner, etc.) exhibit consider- able temporal locality. We demonstrate that accesses to application data structures constitute a significant percentage of the non-stack memory accesses made while processing each packet. Consequently, utilizing a cache for application data structures is highly effective. We demonstrate using two packet processing applications—a Quality of Service (QoS) router and a virus scanner—and several representative packet traces collected from the Internet [35] that a moderate-size (40KB for the applications and traces considered) cache can achieve high (80%) hit rate for application data structures in both single- and multi-threaded processor environments. We also demonstrate that: (1) the techniques for mapping data structures to different levels of a memory hierarchy used in today’s NPs require significantly larger amount of on-chip fast memory to match the per- formance of a system that uses data caching (e.g., for the virus scan- ner application and for the traces we consider, to match the perfor- mance of a system with 5KB cache, a mapping scheme requires over 500KB of on-chip fast memory); and (2) for the same amount of chip- area, a system with data caches outperforms solutions that only cache results of IP address lookups [17, 21, 30, 36]. Finally, we show that by using moderate size data caches, one can reduce packet process- ing times by 30-70% (depending on the miss penalty), and reduce off-chip memory bandwidth requirement by 80-90%. The rest of the paper is organized as follows. In Section 2, we argue that support for data caching in network processors can be ben- eficial. We demonstrate the effectiveness of data caching and quan- 1