RESEARCH CONTRlBUTlONS Programming Techniquesand Data Structures A Locally Adaptive Data Ian Munro Editor JON LOUIS BENTLEY, DANIEL D. SLEATOR, ROBERT E. TARJAN, and VICTOR K. WEI ABSTRACT: A data compression scheme that exploits locality of reference, such as occurs when words are used frequently over short intervals and then fall into long periods of disuse, is described. The scheme is based on a simple heuristic for self-organizing sequential search and on variable-length encodings of integers. We prove that it never performs much worse than Huffman coding and can perform substantially better; experiments on real files show that its performance is usually quite close to that of Huffman coding. Our scheme has many implementation advantages: it is simple, allows fast encoding and decod- ing, and requires only one pass over the data to be com- pressed (static Huffman coding takes two passes). 1. INTRODUCTION Data compression schemes can be categorized by the unit of data they transmit. Huffman [14] codes are typical of “defined-word” schemes: the context de- fines sequences of input symbols (which we shall call words) that are transmitted by a variable-length code. At the other extreme, Ziv-Lempel [26] codes transmit variable-length sequences of input symbols, often using a fixed-length code. In this article we describe a defined-word scheme that uses a technique from another domain that O1966ACM OOOl-0762,'66,'0400-0320 756 deals with defined words: self-organizing sequential search, in which we wish to maintain a sequential list of words so that frequently accessed words are near the front. Our data compression scheme uses a self-organizing list as an auxiliary data structure, and employs short encodings to transmit words near the front of the list. The scheme never performs much worse than Huffman coding. If the message to be transmitted exhibits locality of reference (i.e., if the local frequency of words changes dramatically within the message), the scheme performs better than Huffman coding because a word will have a short encoding when it is used frequently and a long encoding when it is used rarely. Section 2 describes the basic scheme and several dimensions along which it may vary. Mathematical analyses of the performance of the scheme are given in Section 3 and in the Appendix. Experimental evi- dence is presented in Section 4. Section 5 discusses implementation considerations, and Section 6 con- tains concluding remarks. A preliminary version of our results appeared as a conference paper [2]. 2. THE THEME AND SOME VARIATIONS We shall illustrate our scheme by compressing sim- ple “telegraph” messages of words consisting of up- 320 Communications of the ,4CM April 1986 Volume 29 Number 4