CELEBRATING 40 YEARS OF ICT IN LIBRARIES, MUSEUMS AND ARCHIVES An algorithm for sufﬁx stripping M.F. Porter Computer Laboratory, Cambridge, UK Abstract Purpose – The automatic removal of sufﬁxes from words in English is of particular interest in the ﬁeld of information retrieval. This work was originally published in Program in 1980 and is republished as part of a series of articles commemorating the 40th anniversary of the journal. Design/methodology/approach – An algorithm for sufﬁx stripping is described, which has been implemented as a short, fast program in BCPL. Findings – Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex sufﬁxes as compounds made up of simple sufﬁxes, and removing the simple sufﬁxes in a number of steps. In each step the removal of the sufﬁx is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length. Originality/value – The piece provides a useful historical document on information retrieval. Keywords Information retrieval, Computer applications, Historical research Paper type Technical paper 1. Introduction Removing sufﬁxes from words by automatic means is an operation which is especially useful in the ﬁeld of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly the words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vector of words, or terms. Terms with a common stem will usually have similar meanings, for example: CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Frequently, the performance of an IR system will be improved if term groups such as this are conﬂated into a single term. This may be done by removal of the various sufﬁxes, -ED, -ING, -ION, -IONS, to leave the single stem CONNECT. In addition, the The current issue and full text archive of this journal is available at www.emeraldinsight.com/0033-0337.htm This paper was ﬁrst published in Program, Vol. 14 No. 3, July 1980, pp. 130-7. It has been included in this issue as part of a series of articles to commemorate the 40th anniversary of Program. The author is grateful to the British Library R&D Department for the funds which supported this work. An algorithm for sufﬁx stripping 211 Program: electronic library and information systems Vol. 40 No. 3, 2006 pp. 211-218 q Emerald Group Publishing Limited 0033-0337 DOI 10.1108/00330330610681286