Unbounded Length Contexts for PPM J OHN G. CLEARY AND W. J. TEAHAN Department of Computer Science, University of Waikato, Hamilton, New Zealand Email: jcleary@cs.waikato.ac.nz, wjt@cs.waikato.ac.nz The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. PPM is a "nite-context statistical modelling technique that can be viewed as blending together several "xed-order context models to predict the next character in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring considerably greater computational resources (in both time and space), this reliably achieves compression superior to the benchmark PPMC version. Its major contribution is that it shows that the full information available by considering all substrings of the input string can be used effectively to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of compression. Received June 28, 1996; revised July 25, 1997 1. INTRODUCTION The prediction by partial matching (PPM) data compression scheme has set the performance standard in lossless com- pression of text throughout the past decade. The original algorithm was "rst published in 1984 by Cleary and Witten [1], and a series of improvements was described by Mof- fat, culminating in a careful implementation, called PPMC, which has become the benchmark version [2]. This still achieves results superior to virtually all other compression methods, despite many attempts to better it. Other meth- ods such as those based on ZivLempel coding [3, 4] are more commonly used in practice, but their attractiveness lies in their relative speed rather than any superiority in compressionindeed, their compression performance gen- erally falls distinctly below that of PPM in practical bench- mark tests [5]. Prediction by partial matching, or PPM, is a "nite-context statistical modelling technique that can be viewed as blend- ing together several "xed-order context models to predict the next character in the input sequence. Prediction probabilities for each context in the model are calculated from frequency counts which are updated adaptively, and the symbol that ac- tually occurs is encoded relative to its predicted distribution using arithmetic coding [6, 7]. The maximum context length is a "xed constant, and it has been found that increasing it beyond about 5 does not generally improve compression [1, 2, 8]. The present paper 1 describes an algorithm, PPM*, which exploits contexts of unbounded length. It reliably achieves compression superior to the benchmark PPMC version, al- though our current implementation uses considerably greater computational resources (in both time and space). The next section describes the basic PPM compression scheme. 1 A preliminary form of this paper [25] was presented at the 1995 IEEE Data Compression Conference. Following that we give our motivation for the use of con- texts of unbounded length, introduce the new method and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improve- ment of about 6% over the benchmark PPMC. Finally, other seemingly unrelated compression schemes are related to the unbounded-context idea that forms the essential innovation of PPM*. This paper uses the compression achieved on the standard Calgary text compression corpus [5] as a measure of how good the PPM* model is. The importance of this goes beyond the incremental improvement in the size of the com- pressed text. Having a computer model that achieves close to human performance is critical in areas such as speech recognition, spell-checking, OCR and language identi"ca- tion. Teahan and Cleary [9] show how the PPM scheme can be used to build a character-based computer model that can predict English text almost as well as humans. They performed experiments on the same text that Claude E. Shan- non used in a famous experiment to estimate the entropy of English [10], and found that performance was close to, and in some cases superior to, human-based results. It is also well-known in cryptography that removing redundancy is important prior to encryption to prevent statistical attacks [11]. It is important here that there are no models (human or otherwise) that are signi"cantly better than the model used to remove the redundancy. 2. PPM: PREDICTION BY PARTIAL MATCH The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called ‘"nite-context’ models of order k , where k is the number of preceding symbols used. PPM employs a suite of "xed-order context models with different values of THE COMPUTER J OURNAL, Vol. 40, No. 2/3, 1997