Database Replication in Large Scale Systems: Optimizing the Number of Replicas Modou Gueye UCAD-FST Dakar, SENEGAL gmodou@ucad.sn Idrissa Sarr UPMC Paris Universitas LIP6 Lab, FRANCE idrissa.sarr@lip6.fr Samba Ndiaye UCAD-FST Dakar, SENEGAL ndiayesa@ucad.sn ABSTRACT In distributed systems, replication is used for ensuring avail- ability and increasing performances. However, the heavy workload of distributed systems such as web2.0 applications or Global Distribution Systems, limits the benefit of repli- cation if its degree (i.e., the number of replicas) is not con- trolled. Since every replica must perform all updates eventu- ally, there is a point beyond which adding more replicas does not increase the throughput, because every replica is satu- rated by applying updates. Moreover, if the replication de- gree exceeds the optimal threshold, the useless replica would generate an overhead due to extra communication messages. In this paper, we propose a suitable replication management solution in order to reduce useless replicas. To this end, we define two mathematical models which approximate the ap- propriate number of replicas to achieve a given level of per- formance. Moreover, we demonstrate the feasibility of our replication management model through simulation. The re- sults expose the effectiveness of our models and their accu- racy. 1. INTRODUCTION New applications such as Web2.0 applications and Global Distribution Systems manage huge amount of data and deal with heavy workloads. The challenge for these applications is to ensure data availability and consistency in order to deal with fast updates. One solution to face this problem is to use replication. Al- though replication is used to ensure either read performance and write performance, improving both read and write per- formance simultaneously is a more challenging task [4]. To tackle only read performance, master-slave replication is widely used. With this approach, read-only queries are performed on the slave nodes and update queries are sent to the master node. Conversely, to face read and write performance, multi- master replication allows each replicas to store a full copy of the database, thus read or write operations can be handled anywhere. Furthermore, some synchronisation is needed to meet the mutual consistency requirement. To limit the syn- chronisation, which can lead to aborts and thus system scal- ability slowdown, some solution use lazy multi-master repli- cation [16, 6] or delegate the consistency management to the middleware layer [13, 18, 4]. The heavy workload of Web2.0 applications or Global Distribution Systems, limits the benefit of replication if its degree (i.e., the number of replicas) is not controlled. Since every replica must per- form all updates eventually, there is a point beyond which adding more replicas does not increase the throughput, be- cause every replica is saturated by applying updates. More- over, if the replication degree exceeds the optimal threshold, the useless replica would generate an overhead due to extra communication messages. Many solutions have been proposed in the field of database replication, such as [13, 11, 12]. Some solutions include freshness control, for instance [16, 6, 9, 1, 15, 6, 5]. Some other, focus on data availability or fault-tolerant service, such as [2, 7, 8, 19]. We base our work on the DTR approach [18], since it offers update anywhere and freshness control features, and is designed for Global Distribution Systems. DTR proposed a solution which controls the freshness of replicas in order to improve the performance of concurrent updates. Furthermore, DTR availability has been enhanced in [17] by using a middleware-based replication. However none of these previous works attempt to compute which replication threshold will reduce the overhead involved by the management of replicas. Indeed, the formal model de- scribed in [5] for controlling replication freshness, presents good performances in terms of response time and network traffic. Unfortunately, eventual replicas faults is not taken into account, and reducing comunication messages can be improved by limiting the number of replicas. The goal of this paper, is to limit the overhead involved by managing useless replicas and to bring the following contributions: A replication management solution, based on the charac- teristics of the system. We propose a model that estimates the degree of replication with respect to the resources effec- tiveness or volatility of the system for ensuring data avail- ability. We propose two ways to define the appropriate num- ber of replicas: (i) one based on the required system avail- ability and the frequency of nodes failures and (ii) another which takes into account the tolerated staleness of queries and node capabilities in terms of throughput.