J.R. Haritsa, R. Kotagiri, and V. Pudi (Eds.): DASFAA 2008, LNCS 4947, pp. 580–587, 2008. © Springer-Verlag Berlin Heidelberg 2008 Redundant Array of Inexpensive Nodes for DWS Jorge Vieira 1 , Marco Vieira 2 , Marco Costa 1 , and Henrique Madeira 2 1 Critical Software SA Coimbra, Portugal {jvieira,mcosta}@criticalsoftware.com 2 CISUC, Department of Informatics Engineering, University of Coimbra Coimbra, Portugal {mvieira,henrique}@dei.uc.pt Abstract. The DWS (Data Warehouse Striping) technique is a round-robin data partitioning approach especially designed for distributed data warehousing en- vironments. In DWS the fact tables are distributed by an arbitrary number of low-cost computers and the queries are executed in parallel by all the com- puters, guarantying a nearly optimal speed up and scale up. However, the use of a large number of inexpensive nodes increases the risk of having node failures that impair the computation of queries. This paper proposes an approach that provides Data Warehouse Striping with the capability of answering to queries even in the presence of node failures. This approach is based on the selective replication of data over the cluster nodes, which guarantees full availability when one or more nodes fail. The proposal was evaluated using the newly TPC- DS benchmark and the results show that the approach is quite effective. Keywords: Data warehousing, redundancy, replication, recovery, availability. 1 Introduction A data warehouse (DW) is an integrated and centralized repository that offers high capabilities for data analysis and manipulation [8]. Data warehouses represent nowa- days an essential source of strategic information for many enterprises. In fact, as competition among enterprises increases, the availability of tailored information that helps decision makers during decision support processes is of utmost importance. Data warehouses are repositories that usually contain high volumes of data inte- grated from different operational sources. Thus, the data stored in a DW can range from some hundreds of Gigabytes to the dozens of Terabytes [7]. Obviously, this scenario raises two important challenges. The first is related to the storage of the data, which requires large and highly-available storage devices. The second concerns ac- cessing and processing the data in due time, as the goal is to provide low response times for the decision support queries issued by the users. In order to properly handle large volumes of data, allowing performing complex data manipulation operations, enterprises normally use high performance systems to host their data warehouses. The most common choice is systems that offer massive