A Novel Fault-tolerant Task Scheduling Algorithm for Computational Grids Jairam Naik K a , Satyanarayana N b a Research Scholar, Faculty of CSE, JNTU Hyderabad, Andhra Pradesh, India. b Professor, Department of CSE, NGITS Nagole, Hyderabad, Andhra Pradesh, India. jairam.524@gmail.com, nsn1208@gmail.com Abstract— A computational grid is a hardware and software infrastructure that provides consistent, dependable, pervasive and expensive access to high-end computational capabilities in a multi-institutional virtual organization. Computational grids provide computing power needed for execution of tasks. Scheduling the task in computing grid is an important problem. To select and assign the best resources for task, we need a good scheduling algorithm in grids. As grids typically consist of strongly varying and geographically distributed resources, choosing a fault-tolerant computational resource is an important issue. The main scheduling strategy of most fault-tolerant scheduling algorithms depends on the response time and fault indicator when selecting a resource to execute a task. In this paper, a scheduling algorithm is proposed to select the resource, which depends on a new factor called Scheduling Success indicator (SSI). This factor consists of the response time, success rate and the predicted Experience of grid resources. Whenever a grid scheduler has tasks to schedule on grid resources, it uses the Scheduling Success indicator to generate the scheduling decisions. The main scheduling strategy of the Fault-tolerant algorithm is to select resources that have lowest tendency to fail and having more experience in task execution. Extensive experiment simulations are conducted to quantify the performance of the proposed algorithm on GridSim. GridSim is a Java based discrete-event Grid simulation toolkit. Experiments have shown that the proposed algorithm can considerably improve grid performance in terms of throughput, failure tendency and worth. KeywordsResources, Computational Grid, Fault-tolerant, success rate, Simulation, Failure tendency, Resource experience I. INTRODUCTION Grid computing technology of Computer Science has emerged and evolved over the past years from the theoretical research to the application environment. Availability of powerful computers, proliferation of the Internet Technology and the high-speed networks as low-cost commodity components are changing the way we do large-scale parallel and distributed computing. The interest in coupling geographically distributed computational resources is also growing for solving large-scale problems, leading to what is popularly called the Grid and peer-to-peer (P2P) computing networks. These enable sharing, selection and aggregation of required computational and data resources for solving large-scale problems in science, engineering, and commerce. Complexity of computational grids mainly originates from decentralized management and recourse heterogeneity with different security policies. These factors often lead to an increase in the probability of resources to fail than traditional parallel and distributed systems [3] and strong variations in the grid availability. Also, as applications grow to use more resources for longer periods of time, they will inevitably encounter increasing number of resource failures [4]. This will affect the execution of the tasks assigned to the failed resources when failures occur. So, a fault-tolerant service is important in grids. Fault-tolerant is an ability of preserving the delivery of expected services by self, despite the presence of failures within the grid. The various forms of failures in grid computing systems include resource failure, network failure, and application failure [5]. Providing fault- tolerant service in a grid environment, while optimizing resource scheduling and task execution, is an important issue. In computational grids, managing the fault is a very important and challenging problem for grid application developers [5]. To detect faults and resolving them, grid applications must have fault-tolerant services. These services should enable the applications to continue their computations on the resources of the grid without terminating the applications in case of failure. These services also required to satisfy the minimum levels of quality of service (QoS) requirements for applications such as the 978-1-4673-2818-0/13/$31.00 ©2013 IEEE