Universal Data Compression with Side Information at the Decoder by Using Traditional Universal Lossless Compression Algorithms En-hui Yang Department of Electrical and Computer Eng. University of Waterloo Waterloo, ON N2L 3G1, Canada Email: ehyang@uwaterloo.ca Da-ke He Department of Multimedia Technologies IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA Email: dakehe@us.ibm.com Abstract— In this paper we investigate universal data com- pression with side information at the decoder by leveraging traditional universal data compression algorithms. Specifically, consider a source network with feedback in which a finite alphabet source X = {Xi } i=0 is to be encoded and transmitted, and another finite alphabet source Y = {Yi } i=0 available only to the decoder as the side information correlated with X. Assuming that the encoder and decoder share a uniform i.i.d. (independent and identically distributed) random database that is independent of (X, Y ), we propose a string matching-based (variable-rate) block coding algorithm with a simple progressive encoder for the feedback source network. Instead of using standard joint typicality decoding, this algorithm derives its decoding rule from the codeword length function of a traditional universal lossless coding algorithm. As a result, neither the encoder nor the decoder assumes any prior knowledge of the joint distribution of (X, Y ) or even the achievable rates. It is proven that for any (X, Y ) in the class of all stationary, ergodic source-side information pairs with finite alphabet, the average number of bits per letter transmitted from the encoder to the decoder (compression rate) goes arbitrarily close to the conditional entropy rate H(X|Y ) of X given Y asymptotically, and the average number of bits per letter transmitted from the decoder to the encoder (feedback rate) goes to 0 asymptotically. I. I NTRODUCTION Consider the communication system shown in Figure 1, where X = {X i } i=0 denotes the source to be encoded, and Y = {Y i } i=0 denotes the side information correlated with X available only at the decoder. 1 Let R X denote the Encoder Decoder 0 1 2 3 ˆ ˆ ˆ ˆ XXXX 0 1 2 3 XXXX 0 1 2 3 YYYY Fig. 1. A communication system (or a source network) in which the side information Y is available only to the decoder average compression rate in bits per letter resulting from using 1 Throughout the paper we assume that channels are noiseless unless specified otherwise. the encoder in Figure 1 to encode X. From an information theoretic point of view, we are interested to know what is the minimum R X at which the decoder can recover X with arbitrarily small error probability. This problem was first considered by Slepian and Wolf in their seminal work [1]. Specifically, it was shown in [1] that for any memoryless pair (X, Y ), as long as R X >H(X|Y ), where H(X|Y ) denotes the conditional entropy of X given Y , the decoder can recover X with arbitrarily small error probability. The result in [1] was later shown to hold for arbitrary stationary, ergodic sources (X, Y ) and countably infinite alphabets independently by Cover [2] and Ahlswede and K¨ orner [3]. In this paper, the results in [1], [2], and [3] will be collectively referred to as the Slepian-Wolf result for brevity. It should be noted that in [1], [2], [3], the joint probability distribution of (X, Y ) and the joint entropy rate H(X, Y ) are assumed known to both the encoder and the decoder. This assumption, however, may not hold in practice. In situations where the joint distribution of (X, Y ) is unknown, it is desirable to have data compression algorithms that are asymptotically optimal for a class of sources in the sense that they can achieve asymptotically arbitrarily close to H(X|Y ) for any (X, Y ) in the class. Such algorithms are called universal data compression algorithms for the given class of sources. Universal data compression algorithms that systematically take advantage of the feedback channel in Figure 1 were proposed and analyzed for memoryless sources in [4], [5]. As- sume that the encoder and the decoder share a random database which has the same finite dimensional distributions as X, and is statistically independent of (X, Y ). A string matching-based (variable-rate) block coding algorithm with simple progressive encoding was then proposed for the feedback source network in Figure 1. This algorithm assumes no prior knowledge about the achievable rate at the encoder or the decoder, but instead estimates the quantity on the fly during the encoding and decoding process. It was shown [4], [5] that this algorithm, while having asymptotically zero feedback rates, is universal for the class of all memoryless source-side information pairs