Design and Implementation of Various File Deduplication Schemes on Storage Devices Yong-Ting Wu, Min-Chieh Yu, Jenq-Shiou Leu Department of Electronic and Computer Engineering National Taiwan University of Science and Technology Taipei, Taiwan {M10302107, D10002103, jsleu}@mail.ntust.edu.tw Eau-Chung Lee, QNAP Inc., Taipei, Taiwan ytlee@qnap.com Tian Song Electrical and Electronic Engineering, Graduate School of Engineering, Tokushima University, Tokushima City, Japan tiansong@ee.tokushima-u.ac.jp Abstract—As the smart devices revolutionize, people may generate a lot of data and store the data in the local or remote file system in their daily lives. Even though the novel computer hardware and network technologies can handle the demand of generating a big volume of data, effective file deduplication can save storage space in either the private computing environment or the public cloud system. In the paper, we aim at designing and implementing various file deduplication schemes on storage device, which are based on different duplication checking rules, including file name, file size, and file full/partial content hash value. Comprehensive experiment results show that a partial content hashing based file deduplication can have a better trade-off between the computation cost and deduplication accuracy. Keywords—file deduplication; cloud system; storage devices I. INTRODUCTION he emerging technical gadgets, like digital TV, smartphone, pad has rapidly driven a large volume of digit data. When the digital data are stored in a storage system, duplicated data may be conducted due to intended backups or unintended copies. By properly removing file redundancy in the storage system, the volume of information to manage is effectively reduced, significantly lessening the time and space required for file management. B. Hong, D. Plantenberg, D. D. Long, and M. Sivan-Zimet proposed their file deduplication scheme to improve the storage utilization of the storage area network [1]. D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki then used the concept of file partitioning to increase the efficiency of the file deduplication scheme [2]. The aforementioned schemes are running with the online storage, which may not be suitable for the storage devices. Besides, as the network applications have been widely developed and deployed in the world, application users would generate a lot of multimedia data in their daily lives, such as images or video clips captured by the digital cameras or cameras bundled in smartphones. Users then store them in the remote cloud system or the personal local storage. The demand for storage either in the local disk or in the remote storage farm hence increases. In addition, on account of the heterogeneity of the modern smart devices people may own, people more likely own duplicated multimedia data in many storage systems or even in the same storage system, resulting in an ineffective storage utilization and an inefficient search for some specific file in the system. Carrying out file deduplication schemes on the storage system can lessen the situation of wasting the space for duplicated files and increase the file search speed in the file system. The most intuitive deduplication strategy is finding the files with the same file name or size. However, such a strategy may cause an inaccurate deduplicated result. Therefore, a hashing based file deduplication process is designed to increase the accuracy. However, a full content based hashing calculation may increase high computation cost [3]. A compromised way is taking a partial content based hashing calculation, which may bring a faster response to users, with a few sacrifices of deduplication inaccuracy [4, 5]. This work in the paper aims at how to design and implement the various file deduplication schemes for space saving. The detailed data structures, process flows for these schemes are also illustrated. Besides, a comprehensive evaluation results are depicted to validate the effectiveness of the implemented deduplication schemes. The rest of the paper is organized as follows: Section II presents the data structures, process flows used in the three deduplication schemes. Section III details the experiment environment and the corresponding evaluation results. Finally, a brief conclusion is offered in Section IV. II. DEDUPLICATION SCHEME IMPLEMENTATION We briefly design three intuitive approaches to implement the file deduplication schemes on storage devices, including by the filename, by the size, and by the MD5 (Message-Digest algorithm number 5) hash value [6]. The introduction of the data structures and processing flows used is shown below. A. Data Structures To implement the file deduplication system, we need to define the data structures first, and then use the data structures to carry out the file deduplication procedure. 1) By the filename: This is the most intuitive and easiest approach of three deduplication schemes. The user may copy the file into another folder but forget to delete the old one. Hence, the main goal of this approach is to find out and show T QSHINE 2015, August 19-20, Taipei, Taiwan Copyright © 2015 ICST DOI 10.4108/eai.19-8-2015.2260903