DEDISbench: A Benchmark for Deduplicated Storage Systems J. Paulo P. Reis J. Pereira A. Sousa High-Assurance Software Lab (HASLab) INESC TEC & University of Minho Abstract Deduplication is widely accepted as an effective technique for eliminat- ing duplicated data in backup and archival systems. Nowadays, deduplication is also becoming appealing in cloud computing, where large-scale virtual- ized storage infrastructures hold huge data volumes with a significant share of duplicated content. There have thus been several proposals for embedding deduplication in storage appliances and file systems, providing different per- formance trade-offs while targeting both user and application data, as well as virtual machine images. It is however hard to determine to what extent is deduplication useful in a particular setting and what technique will provide the best results. In fact, existing disk I/O micro-benchmarks are not designed for evaluating dedupli- cation systems, following simplistic approaches for generating data written that lead to unrealistic amounts of duplicates. We address this with DEDISbench, a novel micro-benchmark for eval- uating disk I/O performance of block based deduplication systems. As the main contribution, we introduce the generation of a realistic duplicate distri- bution based on real datasets. Moreover, DEDISbench also allows simulating access hotspots and different load intensities for I/O operations. The useful- ness of DEDISbench is shown by comparing it with Bonnie++ and IOzone open-source disk I/O micro-benchmarks on assessing two open-source dedu- plication systems, Opendedup and Lessfs, using Ext4 as a baseline. As a secondary contribution, our results lead to novel insight on the performance of these file systems. 1 Introduction Deduplication is now accepted as an effective technique for eliminating duplicated data in backup and archival storage systems [17] and storage appliances [20], al- lowing not only to reduce the costs of storage infrastructures but also to have a positive performance impact throughout the storage management stack, namely, in cache efficiency and network bandwidth consumption [13, 12, 10]. With the cloud 1