Incorporating Data Movement into Grid Task Scheduling Xiaoshan He 1 , Xian-He Sun 1 1 Department of Computer Science, Illinois Institute of Technology Chicago, Illinois, 60616, USA {hexiaos, sun}@iit.edu Abstract. Task Scheduling is a critical design issue of distributed computing. The emerging Grid computing infrastructure consists of heterogeneous resources in widely distributed autonomous domains and makes task scheduling even more challenging. Grid considers both static, unmovable hardware and moveable, replicable data as computing resources. While intensive research has been done on task scheduling on hardware computing resources and on data replication protocols, how to incorporate data movement into task scheduling seamlessly is unrevealed. We consider data movement as a dimension of task scheduling. A dynamic data structure, Data Distance Table (DDT), is proposed to provide real-time data distribution and communication information. Based on DDT, a data-conscious task scheduling heuristics is introduced to minimize the data access delay. A simulated Grid environment is set up to test the efficiency of the newly proposed algorithm. Experimental results show that for data intensive tasks, the dynamic data-conscious scheduling outperforms the conventional Min-Min significantly. 1 Introduction Grid computing provides a seamless access to immerse network resources, such as high-performance computers and networks, or otherwise unavailable data files. The widely available network resources, however, are geographically distributed, heterogeneous, and under autonomous administration domains. Task scheduling is a vital issue of Grid computing, and, on the other hand, many technical challenges need to be addressed before an efficient task scheduling strategy can be developed. In this study, incorporating data movement and replication into task scheduling is proposed. Consequently, a light-weighted dynamic adjustable strategy for integrating data movement delay with task execution scheduling is introduced. A scheduling heuristic, which treats data as one dimension of the quality of service, is derived to address the issue of data-conscious task scheduling of Grid computing. Intensive research has been conducted in parallel and distributed task scheduling [1, 2, 3]. Task scheduling can be classified as parallel scheduling, where the tasks may be from the same application and have inherent dependence relations, and metatask scheduling, where the tasks are independent from each other [4, 5, 6]. Current Grid scheduling research has been focusing on metatask scheduling. We will focus on metatask scheduling in this study as well. There are two categories of modes of metatask: Online mode and batch mode. Online mode schedules a task upon its arrival