A Platform for Large-Scale Network Performance Analysis David Bauer, Garrett Yaun, Chris Carothers, Murat Yuksel, Shivkumar Kalyanaraman Rensselaer Polytechnic Institute, Troy, NY bauerd@cs.rpi.edu, yaung@cs.rpi.edu, chrisc@cs.rpi.edu, yuksem@ecse.rpi.edu, shivkuma@ecse.rpi.edu Abstract— Performance analysis techniques are fundamental to aid the process of large-scale protocol design and network operations. There has been a tremendous explosion in the variety of tools and platforms available (eg: ns-2, SSFNet, Click Router toolkit, Emulab, Planetlab). However, we still look at the sample results obtained from such tools with skepticism be- cause they are isolated (potentially random) and may not be representative of the real-world. The first issue (random isolated results) can be addressed by large-scale experiment design techniques that extract maximum infor- mation and confidence from a minimum number of carefully designed ex- periments. Such techniques can be used to find “good” results fast to guide either incremental protocol design or operational parameter tuning. The second issue (representativeness) is more sticky and relates to formulating benchmarks. In this paper, we explore the former case, i.e. large-scale experiment design and black-box optimization (i.e. large-dimensional pa- rameter state space search). We propose a new platform ROSS.Net that combines large-scale network simulation with large-scale experiment de- sign and XML interfaces to data sets (eg: Rocketfuel, CAIDA) and mod- els (traffic, topology etc). This is a step towards the broader problem of understanding meta-simulation methodology, and speculate how we could integrate these tools with testbeds like Emulab and Planetlab. Examples of large-scale simulations (routing, TCP, multicast) and experiment design are presented. Keywords— Network Simulation, Experiment Design, Optimistic Simu- lation I. I NTRODUCTION Performance analysis techniques are fundamental to the pro- cess of protocol design and network operations [1], [2], [3]. A variety of techniques have been used by researchers in different contexts: analytic models (eg: TCP models [4], [5], web mod- els [6], self-similar models [7], topology models [8]), simulation platforms (eg: ns-2 [9], SSFnet [10], GloMoSim [11]), prototyp- ing platforms (eg: MIT Click Router toolkit [12], XORP [13]), experimental emulation platforms (eg: Emulab [14]), real-world overlay deployment platforms (eg: Planetlab [15]), real-world measurement and data-sets (eg: CAIDA [16], Rocketfuel [17]). The high-level motivation behind the use of these tools is sim- ple: to gain varying degrees of qualitative and quantitative un- derstanding of the behavior of the system-under-test [1], [18]. A number of specific lower-level objectives include: validation of protocol design and performance for a wide range of parame- ter values (parameter sensitivity), understanding of protocol sta- bility and dynamics, and studying feature interactions between protocols. Broadly, we may summarize the objective as a quest for general invariant relationships between network parameters and protocol dynamics [1], [2], [18]. This explosion of tools has led to the problem of determin- ing the right subset of tools to use in particular situations, and more importantly, determining the right set of experiments to design to maximize the information obtained from such perfor- mance analysis. Systematic design-of-experiments [1], [19] is a well studied area of statistics and performance analysis offer- ing guidance in this aspect. However, a quick survey of papers in the networking field suggest that such systematic techniques (eg: factorial designs, large-scale search) have not been used in the protocol design process or network operations process except possibly by measurement specialists. This ad-hoc approach to organizing simulation or testbed experiments has worked when we design and examine a small number of features, network sce- narios and parameter settings. However, this method is likely to be untenable as we design newer protocols that will rapidly be deployed on a large-scale, or have to deal with a combinato- rial explosion of feature interactions in large operational inter- networks. This point has also been made in a related context by Floyd and Paxson [2], where they pinpoint three reasons why it is hard to simulate the Internet: scale, heterogeneity and rapid change. The need for scalable simulation and meta-simulation tools is implicit in Floyd [3]’s statement: “...we can’t simulate networks of that size (global Internet). And even if we could scale, we would not have the proper tools to interpret the results effectively...” In this paper, we propose a platform ROSS.Net that inte- grates new large-scale experiment design methods with large- scale simulation tools (considering portability issues from ns-2), and suggest an XML standardization for interfaces with mea- surement, modeling and real-world testbeds. It is well known that the central problem in scalable uni- processor network simulation is simply “memory, memory, memory,” and we respond with a memory-efficient simulation engine design. On multi-processor platforms, the ideal ap- proach is optimistic parallel simulation i.e. let each proces- sor go ahead optimistically without global state synchroniza- tion, and rollback if really necessary. A key obstacle to this approach is the huge overhead to maintain checkpoints and per- form rollbacks. We have developed a novel “reversible com- putation” approach that essentially runs the simulation back- ward in time to avoid the checkpointing and rollback overheads. In summary, our simulation contributions include: (i) an op- timistic parallel simulation engine (called ROSS which stands for Rensselaer’s Optimistic Simulation System) which leverages memory-efficient reversible computation instead of using tradi- tional state-saving to support rollback recovery and (ii) system- atic memory-efficient methodology for model construction us- ing a combination of library interfaces to key data structures and algorithms. Beyond mere scaling of simulation platforms, our next need is meta-simulation capabilities, i.e. large-scale experiment design. Statistical experiment design considers the system-under-test as a black-box that transforms input parameters to output metrics. The goal of experiment design is to maximally characterize the