International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Impact Factor (2012): 3.358 Volume 3 Issue 11, November 2014 www.ijsr.net Licensed Under Creative Commons Attribution CC BY A Survey on Secure and Authorized Data Deduplication Shweta Pochhi 1 , Vanita Babanne 2 1, 2 Computer Engineering Department, RMD Sinhgad School of Engineering, Pune University, Pune Abstract: Data deduplication looks for redundancy of sequences of bytes across very large comparison windows. Sequences of data (over 8 KB long) are compared to the history of other such sequences and it is ideal for highly redundant operations like backup which requires repeatedly copying and storing the same data set multiple times for recovery purpose. To protect the confidentiality of sensitive data while supporting deduplication, convergent encryption technique has been designed to encrypt the data. Convergent encryption enables duplicate files to coalesced into the space of a single file, even the files are encrypted with different users’ keys. To overcome attacks, the notion of proofs-of-ownership (PoWs) has been introduced, which lets a client proficiently prove to a server that the client holds a file. Keywords: deduplication, Convergent Encryption, Proof of ownership, Authorized Duplicate Check, Differential Authorization 1. Introduction Cloud computing provides unlimited “virtualized” resources to users as services across the whole Internet, while hiding platform and implementation details. Today’s CSP(cloud service providers) offer both highly available storage and especially parallel computing resources at relatively low costs. As cloud computing becomes ubiquitous, an amount of data is being stored and shared by users with specified privilege he cloud in t, which define the access rights of the stored data. One significant challenge of cloud storage services is the management of the ever-increasing volume of data. Cloud computing provides a low cost, scalable, location independent infrastructure for data management and storage. The rapid adoption of Cloud services is accompanied by increasing volumes of data stored at remote servers, hence techniques for saving disk space and network bandwidth are needed. A central up and coming concept in this context is deduplication, where the server stores a single copy of each file, in spite of of how many clients asked to store that file. All clients that store the file merely use links to the single copy of the file stored at the server. Moreover, if the server already has a copy of the file then clients do not even need to upload it again to the server, thus saving bandwidth as well as storage. In a typical storage system with deduplication, a client first sends to the server only a hash of the file and the server checks if that hash value already exists in its database. If the hash is not in the database then the server asks for the entire file. Otherwise, since the file already exists at the server (potentially uploaded by someone else) it tells the client that there is no need to send the file itself. Both way the server marks the client as an owner of that file, and from that point on there is no difference between the client and the original party who has uploaded the file. The client can therefore ask to restore the file, regardless of whether he was asked to upload the file or not. Data deduplication is data compression technique for eliminating duplicate copies of repeating data in storage. This technique is used to improve storage utilization and can also be applied to network data transfers to decrease the number of bytes that must be sent. Deduplication eliminates redundant data by keeping only one physical copy and referring other redundant data to that copy instead of keeping multiple data copies with the same content. Deduplication can take place at the file level and the block level. It eliminates duplicate copies of the same file at file level and also eliminates duplicate blocks of data that occur in non- identical files at the block level. Data deduplication has certain benefits: Eliminating redundant data can extensively shrink storage requirements and improve bandwidth efficiency. Since primary storage has gotten cheaper over time, typically store many versions of the same information so that new workers can reuse previously done work. Some of the operations like backup store extremely redundant information. Deduplication lowers storage costs as fewer disks are needed. It improves disaster recovery since there's far less data to transfer. Backup/archive data usually includes a lot of duplicate data. The similar data is stored over and over again, consuming unwanted storage space on disk or tape, electricity to power and cool the disk/tape drives and bandwidth for replication. This will create a chain of cost and resource inefficiencies within the organization. While providing data confidentiality, traditional encryption is incompatible with data deduplication. Specifically, it requires different users to encrypt their data with their own keys. Thus, indistinguishable data copies of different users will lead to different cipher texts, making deduplication unfeasible. Convergent encryption has been proposed to impose data confidentiality while making deduplication feasible. It encrypts and decrypts a data copy with a convergent key, which is obtain by computing the cryptographic hash value of the content of the data copy. After key generation and data encryption, users preserve the keys and send the ciphertext to the cloud. Because the encryption operation is deterministic and is derived from the data content, indistinguishable data copies will generate the same convergent key and hence the same ciphertext. Paper ID: OCT141244 1696