Implementation and Evaluation of Parallel Data Mining on PC Cluster and Optimization of its Execution Environments Masato Oguchi, Masaru Kitsuregawa Abstract — Personal Computer/Workstation clusters have been studied intensively in the field of parallel and dis- tributed computing. In the viewpoint of applications, data intensive applications such as data mining and ad-hoc query processing in databases are considered very important for high performance computing, as well as conventional scien- tific calculations. We have built and evaluated PC cluster pilot systems, especially SAN-connected PC cluster, and im- plemented parallel data mining on them. Several optimiza- tion, including dynamic data allocation, is discussed for the execution of this application. Keywords — PC cluster, Data Mining, Storage Area Net- work, Optimization, Dynamic data allocation. I. Introduction Recently personal computer/workstation (PC/WS) clus- ters have become a hot research topic in the field of parallel and distributed computing. They are considered to play an important role as large scale parallel computers in the next generation, for good scalability and cost performance ratio. The reason is as follows: Components of today’s high performance parallel com- puters are evolving from proprietary parts, e.g. CPUs, disks, and memories, into commodity parts. This is be- cause technologies for such commodity parts have ma- tured enough to be used for high-end computer systems. While an interconnection network has not yet been com- moditized thus far, Some common-purpose networks, e.g. Fast/Gigabit Ethernet and ATM-LAN are the strong can- didates as a de facto standard of high speed communication networks. With the progress of high performance networks, future parallel computer systems will undoubtedly employ commodity networks as well. Examining these technologi- cal trends, Fast Ethernet and/or ATM-connected PC clus- ters are considered promising platform for future high per- formance parallel computers. In the viewpoint of application, we believe that data in- tensive applications, such as data mining and ad hoc query processing in databases, are extremely important for mas- sively parallel processors in the near future. We previously developed a large scale ATM-connected PC cluster con- sists of 100 Pentium Pro PCs, and implemented several database applications, including parallel data mining, to evaluate their performance and the feasibility of such ap- plications using PC clusters[1][2]. M. Oguchi is with the Research and Development Initiative, Chuo University, Tokyo, Japan. E-mail: oguchi@computer.org . M. Oguchi and M. Kitsuregawa is with the Institute of Industrial Science, University of Tokyo, Tokyo, Japan. Different from the conventional scientific calculations, as- sociation rule mining, one of the best known problems in data mining, has a peculiar usage of main memory. It allo- cates a lot of small data on main memory, and the number of those areas multiplies to be enormous during the execu- tion. Thus, the requirement of memory space changes dy- namically and becomes extremely large. Contents of mem- ory must be swapped out if the requirement exceeds the real memory size. However, because the size of each data element is rather small and all the elements are accessed almost at random, swapping out to a secondary storage sys- tem is likely to cause severe performance degradation. We are investigating the feasibility of using available memory in remote nodes as a swap area, when application execution nodes need to swap out their memory contents. We report our experimental results in this paper, in which nodes ex- ecuting an application acquire extra memory dynamically from several remote nodes in the ATM-connected PC clus- ter. Moreover, a method using distant node’s memory with remote update operations, which is expected to prevent a thrashing problem, is proposed and evaluated. LAN-connected PC clusters are employed as a system of large scale server sites and/or high performance paral- lel computers. In both cases, huge volume of data may be transferred frequently from one node’s disk to another, for the execution of parallel computing, load distribution, maintenance of the system, and so on. A LAN cluster is a shared-nothing system, that is to say, all nodes of the clus- ter are connected only with a LAN and no disk is shared among them. Therefore, LANs may become almost always busy with data management of disks. The bandwidth of LANs in the cluster should not be flooded with these kinds of data transfer, because LANs should be used for other traffic, such as client-server request communication and parallel/distributed computing among several nodes. In order to reduce LAN traffic and raise availability of nodes on the cluster, Storage Area Networks(SANs), e.g. Fibre Channel, has come to be adopted[3]. SANs link storage devices directly to all nodes of the cluster, there- fore, SANs can prevent bandwidth congestion of LANs. In the case of SAN clusters, different from LAN clusters, each node does not have to communicate with each other through a LAN for reading data from other nodes’ disks, because a pool of storage is shared among all nodes and can be accessed directly through a SAN without burden to the other nodes nor LANs. In this paper, we have also built another PC cluster