Datacompressionwithlongrepeatedstrings JonBentley a, * ,DouglasMcIlroy b a Bell Labs, Room 2C-514, 600 Mountain Avenue, Murray Hill, NJ 07974, USA b Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA Abstract Lempel±Ziv schemes compress data by encoding repeated strings that occur in a small sliding window. We propose a scheme that succinctly encodes long strings that appearfarapartintheinputtext.Suchlongstringsarerareinmostdocuments,but occurfrequentlyindatasuchaslargesoftwaresystems,subroutinelibraries,newsar- ticles, and other corpora of real documents. Analysis shows that our scheme is com- putationallyecient,andexperimentsshowthateectivelycompressessomeclassesof input. Ó 2001PublishedbyElsevierScienceInc. 1. Introduction White [11] see also [3]) proposed compressing text by ``replacing [a] re- peatedstringbyareferenceto[an]earlieroccurrence''.ZivandLempel[12,13] implementedthisideabycleverlyrepresentingstringsthatoccurinarelatively small sliding window. We extend the basic idea to represent long repeated stringsthatmayappearfarapartintheinputtext. On typical English text, our method provides little compression; there are few long repeated strings to be exploited. Some ®les, though, do contain re- peatedlongstrings.Baker[2]documentedsigni®cantrepetitioninthecodeof largesoftwaresystems.Inamathematicalsubroutinelibrary,wefoundmany blocksofcoderepeatedacrossfunctionsoftypes float, double, complex, and double complex; our method combined with a standard compression InformationSciences1352001)1±11 www.elsevier.com/locate/ins * Correspondingauthor. E-mail addresses: jlb@research.bell-labs.com J. Bentley), doug@cs.dartmouth.edu D. McIl- roy). 0020-0255/01/$-seefrontmatter Ó 2001PublishedbyElsevierScienceInc. PII:S0020-025501)00097-4