Managing Heterogeneous Resources in Data Mining Applications on Grids Using XML-Based Metadata Carlo Mastroianni 1 , Domenico Talia 2 , Paolo Trunfio 2 1 ICAR-CNR Via P. Bucci, cubo 41-c, 87036 Rende (CS), Italy mastroianni@icar.cnr.it 2 DEIS Università della Calabria, Via P. Bucci, cubo 41-c, 87036 Rende (CS), Italy talia@deis.unical.it trunfio@deis.unical.it Abstract The Grid supports the sharing and coordinated use of resources in dynamic heterogeneous distributed environments. The effective use of a Grid requires the definition of an approach to manage the heterogeneity of the involved resources that can include computers, data, network facilities and software tools provided by different organizations. This issue get more importance when complex applications, such as data-intensive simulations and data mining applications, executed on a Grid. This paper is concerned with heterogeneous resource management in Grid-based data mining applications. It discusses how resources are represented and managed in the KNOWLEDGE GRID and how XML-based metadata are used to describe data mining tools, data sources, mining models and execution plans, and how those metadata are used for the design and execution of distributed data mining applications on Grids. 1. Introduction The Grid infrastructure supports the sharing and coordinated use of resources in dynamic geographically distributed environments. The effective use of a Grid requires the definition of an approach to manage the heterogeneity of the involved resources that can include computers, data, network facilities and software tools provided by different organizations. Heterogeneity arises mainly from the large variety of resources within each category. For instance, software can run only on some particular host machines whereas data can be extracted from different data management systems such as relational databases, semi-structured databases, plain files, etc. The management of such heterogeneous resources requires the use of metadata, whose purpose is to provide information about the features of resources and their effective use. A Grid user needs to know which resources are available, where resources can be found, how resources can be accessed and when resources are available. Metadata can provide answers about involved computing resources such as data repositories (e.g., databases, file systems, web sites), machines, networks, programs, documents, user agents, etc. Therefore, metadata can represent a key element to effective resource discovery and utilization on the Grid. The role of metadata for resource management on Grids is more and more important as Grid applications are becoming more and more complex. Thus, Grids need to use mechanisms and models that define rich metadata schemas able to able represent the variety of involved resources. This paper is concerned with heterogeneous resource management in Grid-based data mining applications. That is, it addresses the problems of locating and allocating computational, data and information resources, and other activities required to use data mining resources in a knowledge discovery process on Grids. In particular, the paper discusses an XML-based approach for managing heterogeneous resources in the KNOWLEDGE GRID environment [1]. The KNOWLEDGE GRID architecture uses the basic Grid services and defines a set of additional layers to implement the services of distributed knowledge discovery on world wide connected computers where each node can be a sequential or a parallel machine. The KNOWLEDGE GRID architecture (see Figure 1) is designed on top of mechanisms provided by Grid environments such as Globus [2]. The KNOWLEDGE GRID uses the basic Grid services such as communication, authentication, information, and resource management to build more specific parallel and distributed knowledge discovery (PDKD) tools and services. The KNOWLEDGE GRID services are organized into two layers: the Core K-Grid layer, which is built on top of generic Grid services, and the High level K-Grid layer,