Design and Implementation of Various File
Deduplication Schemes on Storage Devices
Yong-Ting Wu, Min-Chieh Yu,
Jenq-Shiou Leu
Department of Electronic and Computer
Engineering
National Taiwan University of Science
and Technology
Taipei, Taiwan
{M10302107, D10002103,
jsleu}@mail.ntust.edu.tw
Eau-Chung Lee,
QNAP Inc., Taipei, Taiwan
ytlee@qnap.com
Tian Song
Electrical and Electronic Engineering,
Graduate School of Engineering,
Tokushima University, Tokushima City,
Japan
tiansong@ee.tokushima-u.ac.jp
Abstract—As the smart devices revolutionize, people may
generate a lot of data and store the data in the local or remote file
system in their daily lives. Even though the novel computer
hardware and network technologies can handle the demand of
generating a big volume of data, effective file deduplication can
save storage space in either the private computing environment or
the public cloud system. In the paper, we aim at designing and
implementing various file deduplication schemes on storage
device, which are based on different duplication checking rules,
including file name, file size, and file full/partial content hash
value. Comprehensive experiment results show that a partial
content hashing based file deduplication can have a better
trade-off between the computation cost and deduplication
accuracy.
Keywords—file deduplication; cloud system; storage devices
I. INTRODUCTION
he emerging technical gadgets, like digital TV, smartphone,
pad has rapidly driven a large volume of digit data. When
the digital data are stored in a storage system, duplicated
data may be conducted due to intended backups or unintended
copies. By properly removing file redundancy in the storage
system, the volume of information to manage is effectively
reduced, significantly lessening the time and space required for
file management. B. Hong, D. Plantenberg, D. D. Long, and M.
Sivan-Zimet proposed their file deduplication scheme to
improve the storage utilization of the storage area network [1].
D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki then used
the concept of file partitioning to increase the efficiency of the
file deduplication scheme [2]. The aforementioned schemes are
running with the online storage, which may not be suitable for
the storage devices.
Besides, as the network applications have been widely
developed and deployed in the world, application users would
generate a lot of multimedia data in their daily lives, such as
images or video clips captured by the digital cameras or
cameras bundled in smartphones. Users then store them in the
remote cloud system or the personal local storage. The demand
for storage either in the local disk or in the remote storage farm
hence increases. In addition, on account of the heterogeneity of
the modern smart devices people may own, people more likely
own duplicated multimedia data in many storage systems or
even in the same storage system, resulting in an ineffective
storage utilization and an inefficient search for some specific
file in the system. Carrying out file deduplication schemes on
the storage system can lessen the situation of wasting the space
for duplicated files and increase the file search speed in the file
system.
The most intuitive deduplication strategy is finding the files
with the same file name or size. However, such a strategy may
cause an inaccurate deduplicated result. Therefore, a hashing
based file deduplication process is designed to increase the
accuracy. However, a full content based hashing calculation
may increase high computation cost [3]. A compromised way is
taking a partial content based hashing calculation, which may
bring a faster response to users, with a few sacrifices of
deduplication inaccuracy [4, 5]. This work in the paper aims at
how to design and implement the various file deduplication
schemes for space saving. The detailed data structures, process
flows for these schemes are also illustrated. Besides, a
comprehensive evaluation results are depicted to validate the
effectiveness of the implemented deduplication schemes.
The rest of the paper is organized as follows: Section II
presents the data structures, process flows used in the three
deduplication schemes. Section III details the experiment
environment and the corresponding evaluation results. Finally,
a brief conclusion is offered in Section IV.
II. DEDUPLICATION SCHEME IMPLEMENTATION
We briefly design three intuitive approaches to implement
the file deduplication schemes on storage devices, including by
the filename, by the size, and by the MD5 (Message-Digest
algorithm number 5) hash value [6]. The introduction of the
data structures and processing flows used is shown below.
A. Data Structures
To implement the file deduplication system, we need to
define the data structures first, and then use the data structures
to carry out the file deduplication procedure.
1) By the filename: This is the most intuitive and easiest
approach of three deduplication schemes. The user may copy
the file into another folder but forget to delete the old one.
Hence, the main goal of this approach is to find out and show
T
QSHINE 2015, August 19-20, Taipei, Taiwan
Copyright © 2015 ICST
DOI 10.4108/eai.19-8-2015.2260903