An Online Data Access Prediction and Optimization Approach for Distributed Systems Renato Porfirio Ishii and Rodrigo Fernandes de Mello Abstract—Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC- CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution time in about 50 percent, specially when handling large amounts of data. Index Terms—Distributed computing, distributed file system, data access optimization, time series analysis, prediction. Ç 1 INTRODUCTION S CIENTIFIC applications tend to produce and handle large amounts of data. Examples of such applications include the Pan-STARRS project, 1 which captures at about 2.5 petabytes (PB) of data per year, and Large Hadron Collider project (LHC), 2 that generates from 50 to 100 PB of data every year. Those applications tend to consider distributed computing tools to deal with the need for high performance and storage requirements. Such tools have great potential for solving a vast class of complex problems; however, they are still limited in terms of manipulating large amounts of data [1]. Those applications rely on cluster and grid environ- ments, which are the main large-scale computing infra- structures currently available. Clusters are composed of a set of workstations or personal computers with similar architecture, connected via a high-speed local area net- work, in which every station usually executes the same operating system. Grids are organized as federations of computers, in which one federation is a different admin- istrative domain sharing its own resources. This type of infrastructure is characterized by hardware and software heterogeneity [2]. These infrastructures meet the demands imposed by several types of applications, however, there is still room for significative theoretical and practical improvements when dealing with large amounts of data. Those improvements would certainly motivate the adoption of distributed computing in several domains such as high energy physics, financial modeling, earthquake simulations, and climate modeling [3], [4], [5], [6]. However, there are still great challenges to meet such improvements, which comprise dealing with heterogeneous resources, access latency varia- tions, fault detection, and recovery, etc. All those challenges have been motivating studies in different subjects such as job scheduling, load balancing, communication protocols, high-performance hardware ar- chitectures, and data access optimization [4], [7]. This later subject, i.e., data access optimization, is particularly appeal- ing to the Computer Science area, mainly due to the possibility to provide high performance to data-intensive applications [5], [6], [8]. This possibility has motivated the study of the Data Access Problem (DAP) in distributed computing environ- ments [9]. The complexity of this NP-complete problem is justified by the optimization of data location and consis- tency, which involve operations such as data distribution, migration and replication [10]. Data distribution aims at efficiently allocating file chunks on the computing environ- ment attempting to improve the performance of read-and- write operations. Migration is also considered in order to IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 23, NO. 6, JUNE 2012 1017 . R.P. Ishii is with the Faculty of Computer Science, Federal University of Mato Grosso do Sul-UFMS, Cidade Universita´ria, Caixa Postal 549, Campo Grande, MS 79070-900, Brazil. E-mail: renato@facom.ufms.br. . R.F. de Mello is with the Department of Computer Science, Institute of Mathematics and Computer Sciences, University of Sa˜o Paulo-USP, Avenida Trabalhador sa˜o-carlense, 400, Centro, Caixa Postal 668, Sa˜o Carlos, SP 13560-970, Brazil. E-mail: mello@icmc.usp.br. Manuscript received 17 May 2010; revised 19 Aug. 2011; accepted 23 Sept. 2011; published online 19 Oct. 2011. Recommended for acceptance by I. Stojmenovic. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2010-05-0299. Digital Object Identifier no. 10.1109/TPDS.2011.256. 1. http://pan-starrs.ifa.hawaii.edu/public. 2. http://public.web.cern.ch/public/en/LHC/LHC-en.html. 1045-9219/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society