Fast and Accurate Sample Transfers for Real-Time Throughput Optimization Hemanta Sapkota and Engin Arslan Computer Science and Engineering, University of Nevada, Reno Email: hsapkota@nevada.unr.edu, earslan@unr.edu Abstract—Real-time techniques to optimize data transfer throughput offer promising solutions as they can discover op- timal configuration settings in the runtime without requiring an upfront work or making assumptions about underlying system architectures. On the other hand, current real-time solutions suffer from slow convergence speed as they need to run many sample transfers to evaluate various settings before determining the optimal one. In this work, we propose a mathematical model to estimate sample transfer throughput quickly, shortening search time of real-time solutions and increasing overall gain. Preliminary results shows that our model can estimate sample transfer duration in as low as 3 seconds without degrading estimation accuracy. I. I NTRODUCTION Large scientific experiments such as environmental and coastal hazard prediction [1], climate modeling [2], and high- energy physics simulations [3], [4] generate data volumes reaching petabytes per year. This huge volume of data is often moved to remote sites for various purposes such as processing, collaboration, and archival. Among previous works on transfer optimization in high-speed networks, real-time solutions [5], [6] offer promising results as they can adapt to changing network conditions by running sample transfer in the runtime. Hence, the benefit of real-time optimization solutions heavily rely on the accuracy and duration of sample transfers. On the other hand, share nature of network and end system resources causes significant fluctuations in transfer throughput, hindering fast and accurate estimation of average sample transfer throughput. Current solutions to run sample transfers include fixed data size [7], fixed time duration [8], and adaptive approach [5]. Among them, adaptive approach promises fast convergence with highest accuracy, however we found that it can fail to converge when transfer throughput fluctuates a lot. Adaptive approach works by initiating whole dataset transfer and moni- tors instantaneous throughput periodically (e.g., every second). It assumes convergence is reached when throughput ratio of two consecutive intervals is closer than a defined threshold. In this project, we aim to derive a model for instantaneous transfer throughput to estimate average throughput expedi- tiously and accurately. Hence, we came up with a predictive model that relates throughput to transfer time such that we can predict future throughpur values and identify convergence time and value. II. MODEL AND PRELIMINARY RESULTS The model relates transfer time to transfer throughput as shown in Equation 1. In the model, y refers to estimated throughput, t refers to the time since start of transfer (in seconds), a and b are coefficients, and n is a constant. We observed that n =2 works well in our experiments. Non- linear least squared analysis is used to calculate the value of a and b. Instead of deriving one equation for all networks and transfers using historical data, we solve the equation (aka finding the a and b values) in the runtime for each transfer as transfers exhibit unique behavior based on network settings and dataset characteristics. y = a + bt -n (1) We evaluated the model by running transfers between three supercomputers: Bridges at PSC, Stampede2 at TACC and Comet at SDSC using GridFTP and compared against adaptive sampling approach. We transferred a 7 GB file between three sites for 4,200 times with two parameter configurations; one with single TCP stream and the other with 8 TCP streams. Fig. 1: The proposed model offers over 90% accuracy. Figure 1 illustrates the accuracy ratio of algorithms for different execution time. For example, the Model estimated transfer throughput in 3 seconds with 90% error rate for 8-stream transfers whereas it takes 15 seconds for adaptive approach to achieve similar accuracy. Moreover, the Model can achieve up to 96% accuracy while adaptive approach is unable to go beyond 80% accuracy even after 20 seconds. III. CONCLUSION AND FUTURE WORK In this work, we propose a mathematical model to esti- mate convergence throughput of sample transfers to alleviate sampling overhead for real-time transfer optimization. Our preliminary results indicate that the proposed model can estimate sample transfer throughput in less than five seconds with very high accuracy rates compared to the state-of-the art solution. As a future work, we plan to evaluate the model using different datasets with various transfer settings such as multi-file datasets and concurrent file transfers.