A comparative study of different heavy tail index estimators of the flow size from sampled data Patrick Loiseau Patrick.Loiseau@ens- lyon.fr Paulo Gonçalves Paulo.Goncalves@ens- lyon.fr Pascale Primet Vicat-Blanc Pascale.Primet@ens- lyon.fr INRIA UNIVERSITÉ DE LYON ABSTRACT In this article, we address the problem of estimating the tail parameter of a ﬂow size distribution from sampled packet traﬃc. Based on synthetic data, we perform a systematic comparison of several estimators proposed in the literature. In the course, we propose a variant to an existing method which takes into account some statistical a priori on the expected distribution. This adapted estimator shows a sig- niﬁcantly improved performance, as compared to the others. Categories and Subject Descriptors C.2.3 [Computer-Communication Networks]: Network Operations—Network monitoring ; G.3 [Probability and Statistics] General Terms Measurement, Theory Keywords Heavy Tail Distribution, Packet Sampling, Estimation, Grid 1. INTRODUCTION 1.1 Motivation Grids are distributed systems, based on shared computa- tional, storage and visualisation resources interconnected by long distance networks. Compared to clusters, they in- troduce new scales in terms of heterogeneity and number of co-operative equipments, users’ community size, number of inter-dependant processes, processing capacities, band- width, etc. The large distances between computation en- tities leading to large delays, together with the increasing possibility of loosing packets turn communications perfor- mance into a major challenge in grid networks. For a decade, research on grid performances have essentially been based on internet transport protocols such as TCP or UDP. However, the particular topology of grid networks and the speciﬁcity of grid applications makes the grid context very diﬀerent from the internet’s one. For example, our grid testbed, Grid5000, an experimental grid platform gathering a total of 5000 CPUs is based on a dumbbell core topology interconnecting 9 sites geographically distributed in France with very high speed access links. The access rate of com- puting nodes in such environment is 1Gb/s. Some resources are interconnected by 10Gb/s links. In grid networks, the aggregation level is often quite low. This gives rise to a few interrogations : are these internet protocols adapted to grid applications ? do they guarantee optimal Quality of Service (QoS) and security ? what are the inﬂuence of the diﬀerent parameters of these protocols on the performance of speciﬁc applications ? To answer these questions, traﬃc characteristics have to be studied in grid context. Traﬃc characteristics have already been studied in the in- ternet for a decade. Long Range Dependance (LRD) and self similarity have been observed in internet traces. Typ- ical ﬂow characteristics such as heavy tail have also been observed. Then, the modelling of the traﬃc has become a very active ﬁeld of research. A few theoretical and empiri- cal results arose about traﬃc characterisation [14, 10]. But these results are mainly based on the traﬃc observed in the internet. Are these results still valid in the grid context ? To what extend can they be adapted to grids ? Lots of methods (see [1] and references within, [9, 5, 8, 13] for traﬃc characterisation are based on the observation of the entire traﬃc (i.e. every packets are picked). However, such methods are very challenging in very high speed networks because of memory and CPU consumption issues. For example, in the worse case where we have to deal with a 64 Bytes packet stream reaching the maximal bandwidth of 10Gb/s, the time available to process a packet is about 50 ns. Moreover, to stock a 56 Bytes header for each packet would need to stock more than 1GB/s. It is then necessary to sample, i. e. to pick only a sub-sample of the packets going through the link. When observing only a sub-sample of the traﬃc, the estimation of the ﬂow size dis- tribution tail parameter α and LRD parameter H is harder. The major question addressed in this paper is the estimation of the tail parameter α from sampled traﬃc.