Applying Grid Technologies to Distributed Data Mining A. C. Hume 1 , A. D.Lloyd 2,3 , T. M. Sloan 1 , A. C. Carter 1 1 EPCC, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh, EH9 3JZ, UK 2 The University of Edinburgh Management School, The University of Edinburgh, 7 Bristo Square, Edinburgh, EH8 9AL, UK 3 Curtin Business School, Curtin University of Technology, GPO Box U1987, Perth WA 6845, Australia Abstract The Grid promises improvements in the effectiveness with which global businesses are managed, if it enables distributed expertise to be efficiently applied to the analysis of distributed data. We report an ESRC-funded collaboration between EPCC in Edinburgh and Curtin University of Technology in Perth, Australia, that is applying public-domain Grid technologies to secure data mining within a commercial environment. We describe this Grid infrastructure and discuss its strengths and weaknesses. 1. Introduction Data mining projects often require distributed analysts to submit jobs to distributed compute resources that process data from distributed data resources. These requirements, along with others such as secure communications and access control, make data mining an ideal application of Grid technologies. The INWA project [1] has investigated the suitability of existing grid technologies for secure commercial data mining. This project has been funded under the Pilot Projects in E- Social Science programme [2] of the UK’s Economic and Social Research Council (ESRC). The full title of the project is ‘Informing Business and Regional Policy: Grid- enabled fusion of global data and local knowledge’ but for ease of communication this has been abbreviated to INWA. The project is a collaboration between various academic and commercial organisations from the UK and Australia. EPCC [3], the University of Edinburgh Management School (UEMS) [4] and Lancaster University Management School (LUMS) [5] are the academic partners in the UK with Curtin Business School [6] from the Curtin University of Technology the academic partner in Perth, Australia. The various commercial partners are from the UK and Australia Financial, telecommunications and property data have been provided by the commercial partners. These partners have also helped formulate requirements for mining of this data. The major requirement being to ensure that any data supplied can only be accessed by trusted parties. The data from UK partners is sited at EPCC with the Australian data sited at Curtin. Sun Microsystems in Australia provided the project with the compute servers for the data located at Curtin. Such a collaboration between multiple data services in multiple jurisdictions tests acceptance of the grid – a pre-requisite for anyone to adopt this technology. 2. The INWA Grid Infrastructure The project has designed and implemented a Grid Infrastructure using existing freely available Grid technology. This allows analysts at Edinburgh or Perth to submit batch jobs securely that are run on a compute resource local to the data being processed. The results from the batch jobs are automatically transferred back to the user. The Infrastructure also allows analysts to interact with the relational data sources via SQL queries. To submit and transfer batch jobs and their results between local and remote sites the Infrastructure uses Grid Engine V5.3[7] as the compute resource manager, and Transfer-queue Over Globus (TOG) [8] with Globus Toolkit V2 [9] for the grid middleware. Grid Engine is an open source distributed resource management system that allows the