A comparative study of different heavy tail index estimators of the flow size from sampled data Patrick Loiseau Patrick.Loiseau@ens- lyon.fr Paulo Gonçalves Paulo.Goncalves@ens- lyon.fr Pascale Primet Vicat-Blanc Pascale.Primet@ens- lyon.fr INRIA UNIVERSITÉ DE LYON ABSTRACT In this article, we address the problem of estimating the tail parameter of a flow size distribution from sampled packet traffic. Based on synthetic data, we perform a systematic comparison of several estimators proposed in the literature. In the course, we propose a variant to an existing method which takes into account some statistical a priori on the expected distribution. This adapted estimator shows a sig- nificantly improved performance, as compared to the others. Categories and Subject Descriptors C.2.3 [Computer-Communication Networks]: Network Operations—Network monitoring ; G.3 [Probability and Statistics] General Terms Measurement, Theory Keywords Heavy Tail Distribution, Packet Sampling, Estimation, Grid 1. INTRODUCTION 1.1 Motivation Grids are distributed systems, based on shared computa- tional, storage and visualisation resources interconnected by long distance networks. Compared to clusters, they in- troduce new scales in terms of heterogeneity and number of co-operative equipments, users’ community size, number of inter-dependant processes, processing capacities, band- width, etc. The large distances between computation en- tities leading to large delays, together with the increasing possibility of loosing packets turn communications perfor- mance into a major challenge in grid networks. For a decade, research on grid performances have essentially been based on internet transport protocols such as TCP or UDP. However, the particular topology of grid networks and the specificity of grid applications makes the grid context very different from the internet’s one. For example, our grid testbed, Grid5000, an experimental grid platform gathering a total of 5000 CPUs is based on a dumbbell core topology interconnecting 9 sites geographically distributed in France with very high speed access links. The access rate of com- puting nodes in such environment is 1Gb/s. Some resources are interconnected by 10Gb/s links. In grid networks, the aggregation level is often quite low. This gives rise to a few interrogations : are these internet protocols adapted to grid applications ? do they guarantee optimal Quality of Service (QoS) and security ? what are the influence of the different parameters of these protocols on the performance of specific applications ? To answer these questions, traffic characteristics have to be studied in grid context. Traffic characteristics have already been studied in the in- ternet for a decade. Long Range Dependance (LRD) and self similarity have been observed in internet traces. Typ- ical flow characteristics such as heavy tail have also been observed. Then, the modelling of the traffic has become a very active field of research. A few theoretical and empiri- cal results arose about traffic characterisation [14, 10]. But these results are mainly based on the traffic observed in the internet. Are these results still valid in the grid context ? To what extend can they be adapted to grids ? Lots of methods (see [1] and references within, [9, 5, 8, 13] for traffic characterisation are based on the observation of the entire traffic (i.e. every packets are picked). However, such methods are very challenging in very high speed networks because of memory and CPU consumption issues. For example, in the worse case where we have to deal with a 64 Bytes packet stream reaching the maximal bandwidth of 10Gb/s, the time available to process a packet is about 50 ns. Moreover, to stock a 56 Bytes header for each packet would need to stock more than 1GB/s. It is then necessary to sample, i. e. to pick only a sub-sample of the packets going through the link. When observing only a sub-sample of the traffic, the estimation of the flow size dis- tribution tail parameter α and LRD parameter H is harder. The major question addressed in this paper is the estimation of the tail parameter α from sampled traffic.