Word-based text compression using the Burrows–Wheeler transform q Alistair Moffat * , R. Yugo Kartono Isal Department of Computer Science and Software Engineering, The University of Melbourne, Vic. 3010, Australia Received 10 December 2003; accepted 11 August 2004 Available online 14 October 2004 Abstract Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows–Wheeler transform (BWT BWT); applying a move-to-front (MTF MTF) transform to each of the permuted blocks; and then entropy coding the output with a Huffman or arithmetic coder. Until now, block-sorting implementations have assumed that the input message is a sequence of characters. In this paper we extend the block-sorting mechanism to word-based models. We also consider other recency transformations, and are able to show improved compression results compared to MTF MTF and uniform arithmetic coding. For large files of text, the combination of word-based modeling, BWT BWT, and MTF MTF-like transformations allows excellent compression effectiveness to be attained within reasonable resource costs. Ó 2004 Elsevier Ltd. All rights reserved. Keywords: Text compression; Word-based model; Burrows–Wheeler transformation; Recency ranking; Move-to-Front 1. Introduction Block-sorting is an innovative compression mechanism introduced by Burrows and Wheeler (1994). Block-sorting compression involves three steps: permuting the input one block at a time through the use of the Burrows–Wheeler transform (BWT BWT); applying a recency transformation (usually, but not always, move-to-front, MTF MTF) to each of the permuted blocks; and then entropy coding the output with a 0306-4573/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2004.08.009 q This paper combines material presented in preliminary form at the 2001 IEEE Data Compression Conference (by the current authors) and at the 2002 Australasian Computer Science Conference (written in collaboration with a third author, Alwin C.H. Ngai). * Corresponding author. E-mail address: alistair@cs.mu.oz.au (A. Moffat). URL: http://www.cs.mu.oz.au. Information Processing and Management 41 (2005) 1175–1192 www.elsevier.com/locate/infoproman