J.R. Haritsa, R. Kotagiri, and V. Pudi (Eds.): DASFAA 2008, LNCS 4947, pp. 580–587, 2008.
© Springer-Verlag Berlin Heidelberg 2008
Redundant Array of Inexpensive Nodes for DWS
Jorge Vieira
1
, Marco Vieira
2
, Marco Costa
1
, and Henrique Madeira
2
1
Critical Software SA
Coimbra, Portugal
{jvieira,mcosta}@criticalsoftware.com
2
CISUC, Department of Informatics Engineering, University of Coimbra
Coimbra, Portugal
{mvieira,henrique}@dei.uc.pt
Abstract. The DWS (Data Warehouse Striping) technique is a round-robin data
partitioning approach especially designed for distributed data warehousing en-
vironments. In DWS the fact tables are distributed by an arbitrary number of
low-cost computers and the queries are executed in parallel by all the com-
puters, guarantying a nearly optimal speed up and scale up. However, the use of
a large number of inexpensive nodes increases the risk of having node failures
that impair the computation of queries. This paper proposes an approach that
provides Data Warehouse Striping with the capability of answering to queries
even in the presence of node failures. This approach is based on the selective
replication of data over the cluster nodes, which guarantees full availability
when one or more nodes fail. The proposal was evaluated using the newly TPC-
DS benchmark and the results show that the approach is quite effective.
Keywords: Data warehousing, redundancy, replication, recovery, availability.
1 Introduction
A data warehouse (DW) is an integrated and centralized repository that offers high
capabilities for data analysis and manipulation [8]. Data warehouses represent nowa-
days an essential source of strategic information for many enterprises. In fact, as
competition among enterprises increases, the availability of tailored information that
helps decision makers during decision support processes is of utmost importance.
Data warehouses are repositories that usually contain high volumes of data inte-
grated from different operational sources. Thus, the data stored in a DW can range
from some hundreds of Gigabytes to the dozens of Terabytes [7]. Obviously, this
scenario raises two important challenges. The first is related to the storage of the data,
which requires large and highly-available storage devices. The second concerns ac-
cessing and processing the data in due time, as the goal is to provide low response
times for the decision support queries issued by the users.
In order to properly handle large volumes of data, allowing performing complex
data manipulation operations, enterprises normally use high performance systems to
host their data warehouses. The most common choice is systems that offer massive