1 Chunk and Object Level Deduplication for Web Optimization: A Hybrid Approach Ioannis Papapanagiotou, Student Member, IEEE, Robert D. Callaway, Member, IEEE, and Michael Devetsikiotis, Fellow, IEEE Abstract—Proxy caches or Redundancy Elimination (RE) sys- tems have been used to remove redundant bytes in WAN links. However, they come with some inherited deficiencies. Proxy caches provide less savings than RE systems, and RE systems have limitations related to speed, memory and storage overhead. In this paper we advocate the use of a hybrid approach, in which each type of cache acts as a module in a system with shared memory and storage space. A static scheduler precedes the cache modules and determines what types of traffic should be forwarded to which module. We also propose several optimizations for each of the modules, such that the storage and memory overhead are minimized. We evaluate the proposed system by performing a trace driven emulation. Our results indicate that a hybrid system is able to provide better savings than a proxy cache, or a standalone RE system. The hybrid system requires less memory, less disk space and provides a speed-up ratio equal to three compared to an RE system. Index Terms—WAN optimization, Traffic deduplication, Re- dundancy Elimination, Hybrid Cache I. I NTRODUCTION T He exponential growth of mobile data traffic has led service providers to implement data deduplication sys- tems. Data deduplication can remove repetitive patterns in traffic streams, and decrease response times for time sensitive applications. The most widely implemented technique in wired networks were proxy caches [1]. Although proxy caches remove redundancy at the object level, there is a vast amount of web data that is uncacheable per RFC 2616 [2]. Some recent studies [3], [4], [5] have advocated the benefits of protocol-independent Redundancy Elimination techniques (also called byte caches). They can remove redundancy at the chunk level, which is a much smaller granularity than at the object level. Hence even if an object or a file is partially modified prior to being transferred for a second time, the unchanged parts would still benefit from the optimization. Initially, chunks were identified inside each packet. However, in [6] the authors proposed a WAN optimization system that removes duplicate chunks on top of the TCP layer. TCP based chunks are bigger than packet based chunks, but smaller than objects. The advantage of this approach is that it can identify the redundant bytes even if they span over many packets. An RE implementation requires the installation of two middleware boxes; one closer to the server (encoder) and one closer to the client (decoder). As data flows from the server The authors are with the Department of Electrical and Computer Engineer- ing, North Carolina State University, Raleigh, NC, 27695-7911 USA and with the IBM WebSphere Technology Institute, RTP, NC, USA. Emails:(ipapapa, mdevets)@ncsu.edu and rcallawa@us.ibm.com. to the client, it passes through the boxes and is broken into chunks. The chunks are stored on the persistent storage of each box. For each chunk, a representing fingerprint (hash) that maps to the actual chunk is generated and stored in the memory (e.g. a 1KB stream can be represented by a collision-free 20B hash). The two boxes communicate through an out-of-band TCP connection such that the data are delivered in order. Since both boxes contain the same data, they are synchronized. A second reference to a chunk would mean that the encoding box would send the hash value instead of the actual bytes [3]. In RE systems, fingerprinting is performed based on the Rabin fingerprinting algorithm [7]. A sliding window moves byte and byte, generating the fingerprints. Each one of them is compared with a global constant to derive the boundaries of each chunk. However, performing these steps over all of the data may create a bottleneck in higher bandwidth links [8]. Moreover, the hash overhead per object in proxy caches is much smaller than the hash overhead per chunk in RE systems. In this paper, we propose a hybrid redundancy elimination technique. The proposed system consists of a scheduler, an RE module and a proxy cache module. The decision on which module the byte stream should flow through is determined by the scheduler. The benefits of such an approach can be summarized as follows: Fewer hash computations are performed, therefore al- lowing our hybrid system to be deployed within higher bandwidth links. An application layer compression scheme, instead of the standard IP packet-based RE, should lead to better savings and storage overhead. A reduction in the memory overhead compared to a stan- dalone RE system. As some byte streams flow through the proxy cache module, they do not need to be broken in chunks and unnecessary hash generation and storage can be avoided. The proposed approach can be implemented on the encoding box of a WAN optimization system without significant architectural modifications. The remainder of our paper is organized as follows: In Section II, we describe the proposed system design. In Section III, we briefly describe the emulation environment and the dataset that has been used to validate the suggested implemen- tation. In Section IV, we showcase results on how effective is the proposed system, under various parameters and criteria. Finally, in Section V we conclude with our remarks.