Parallel Star Join and Cube Operations for Data Warehousing in a Cluster Computer Environment Amit Rudra and Raj Gopalan School of Information Systems CURTIN UNIVERSITY of Technology, Australia Email: rudraa@cbs.curtin.edu.au School of Computing CURTIN UNIVERSITY of Technology, Australia Email: raj@cs.curtin.edu.au ABSTRACT Decision oriented technologies, like data warehousing and on-line analytical processing systems store and handle very large volumes of data, requiring more efficient ways of dealing with them. Recent advances in parallel computing and high-speed networks using a cluster of PCs or workstations (COWs) offer a low cost solution for providing this scale up in performance by parallelism of data, and its processing, in the data warehouse. However, there are issues peculiar to clusters that first need to be considered. This paper investigates how the star join and data cube operations can be performed in parallel on a cluster of PCs. Keywords: cluster computing, cluster of workstations, data warehousing, parallel data warehousing 1. INTRODUCTION With the availability of increasingly powerful processors and large amount of memory (RAM), PCs are becoming more and more viable for solving serious problems on them. It is no wonder, therefore, that clusters of PCs are becoming quite serious contenders for the parallel systems market [1]. When a number of PCs are connected by a high-speed LAN and a suitable software is put on them the ensuing cluster thus formed provides enormous potential for solving high-perfomance demanding computing problems. Lately, data warehousing as an application has generated considerable research interest both among academics and the industry [3]. If properly implemented and used, it offers great advantages for business and industry. It also presents significant challenges to the research community as with growing gigabytes of data added to it every week or month, a data warehouse has the potential to quickly overflow the available storage space. In this paper, we look at the issues involved in implementing a data warehouse in a cluster computer environment using PCs. A data warehouse can be defined as an online archive of historical enterprise data that is aimed at enabling the knowledge worker (executive, manager, analyst) make better and faster decisions [3]. The predominant reason for using a data warehouse is better performance in the management of historical data. While the operational database is designed to handle the day to day operations of the organisation, the data warehouse is geared to facilitate the management’s ad-hoc decision support queries. Typical applications of the former are termed online transaction processing (OLTP), whereas applications of the latter are known as online analytical processing (OLAP). The functional and performance requirements of OLTP and OLAP are quite different [3]. Characteristically, OLTP applications are the bread and butter of an organisation. The tasks of such applications are structured and repetitive with short atomic transactions automating clerical data processing tasks. The data need to be up-to-date and read/update transactions are typical of such applications. Data warehousing and OLAP applications, on the other hand, are for decision support. Here historical and summarised data is more