Comparing Analytical Modeling with Simulation for Network Processors: A Case Study Matthias Gries 1 , Chidamber Kulkarni 1 , Christian Sauer 2 , Kurt Keutzer 1 1 University of California, Berkeley 2 Infineon Technologies, Corporate Research, Munich {gries, kulkarni, sauer, keutzer}@eecs.berkeley.edu Abstract Programming network processors remains an art due to the variety of different network processor architectures and due to little support to reason and explore implemen- tations on such architectures. We present a case study of mapping an IPv4 forwarding switch application on the Intel IXP1200 network processor and we compare this im- plementation with an analytical model of both the applica- tion and architecture used to evaluate different design alternatives. Our results not only show that we are able to model the IXP1200 and our application within 15% of the accuracy compared to that of IXP1200 simulation, but also find closely matching trends for different workloads. This shows the clear potential of such analytical techniques for design space exploration. 1. Introduction Contemporary network processors (NPs) exhibit a wide range of architectures for performing similar tasks: from simple RISC cores with dedicated peripherals, in pipelined and/or parallel organization, to heterogeneous multiproces- sors, based on complex multi-threaded cores with custom- ized instruction sets. Evaluating such disparate architec- tures via extensive benchmarking is time consuming and tedious, due to absence of a proper programming model. Programming such concurrent systems remains an art. The programmer not only is required to partition and bal- ance the load of the application manually, it is also neces- sary to implement each task, often in assembly, in order to get reliable performance estimation. Hence, a robust appli- cation mapping strategy for such architectures requires a balance between thread partitioning, scheduling, memory accesses and I/O. With the current state-of-the-art tools this task becomes time consuming and error prone, due to trial- and-error method employed by system implementers based on simulation runs. Therefore, methods to address the above issues need to be investigated. For the next generation of network processor based sys- tem implementations, we strongly believe that a consider- able emphasis will be put on performance per cost (for ex- ample, power consumption) aspects and on support of ap- propriate programming models. Therefore, it is essential to investigate methods that help in identifying limitations and bottlenecks in system implementation without going all the way down to complete implementations, as is the current practice. Consequently, high-level design space exploration tools are required that support a wide range of heterogene- ous architectures and enables precise reasoning about dif- ferent implementation styles and their performances. Data generated by such tools while evaluating single design points should ideally be indicative of final achievable qual- ity of results. Related works focus on three different approaches namely simulation, trace analysis and analytical models. Simulation and trace analysis are somewhat similar, since for generating a trace one needs an (cycle) accurate simula- tor. Simulation based approaches, such as [1], require ap- plication specification in a high-level language or assem- bly, compiled to the particular architecture. In addition, different workloads need to be specified or generated for both the simulation and trace based analysis. Trace based analysis (like [2]) is limited by the fact that they capture details of a single execution for the particular workload. Thus for event driven systems with varying workloads, a large amount of traces need to be generated for any useful analysis and hence the gap between simulation and trace is no longer that large. In contrast to simulation and trace analysis, analytical models promise a fast evaluation that allows for a larger design space to be explored. In the packet processing do- main, [3] presents an approach to explore different cache configurations based on general purpose computing ele- ments. Lakshamanamurthy et al. [4] present an ad-hoc approach limited to the IXP2400 network processor. Thiele et al. [5] present a generic approach for modeling applications and architectures in this domain. This model