CELEBRATING 40 YEARS OF ICT IN LIBRARIES, MUSEUMS AND ARCHIVES An algorithm for suffix stripping M.F. Porter Computer Laboratory, Cambridge, UK Abstract Purpose – The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. This work was originally published in Program in 1980 and is republished as part of a series of articles commemorating the 40th anniversary of the journal. Design/methodology/approach – An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Findings – Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length. Originality/value – The piece provides a useful historical document on information retrieval. Keywords Information retrieval, Computer applications, Historical research Paper type Technical paper 1. Introduction Removing suffixes from words by automatic means is an operation which is especially useful in the field of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly the words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vector of words, or terms. Terms with a common stem will usually have similar meanings, for example: CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes, -ED, -ING, -ION, -IONS, to leave the single stem CONNECT. In addition, the The current issue and full text archive of this journal is available at www.emeraldinsight.com/0033-0337.htm This paper was first published in Program, Vol. 14 No. 3, July 1980, pp. 130-7. It has been included in this issue as part of a series of articles to commemorate the 40th anniversary of Program. The author is grateful to the British Library R&D Department for the funds which supported this work. An algorithm for suffix stripping 211 Program: electronic library and information systems Vol. 40 No. 3, 2006 pp. 211-218 q Emerald Group Publishing Limited 0033-0337 DOI 10.1108/00330330610681286