Computer Methods and Programs in Biomedicine 62 (2000) 109 – 113 Supporting the classification of pathology reports: comparing two information retrieval methods L.M. de Bruijn a, *, A. Hasman a , J.W. Arends b a Department of Medical Informatics, Uniersity of Maastricht, Maastricht, The Netherlands b Department of Pathology, Academic Hospital, Maastricht, Maastricht, The Netherlands Received 29 June 1999; accepted 4 January 2000 Abstract In this contribution two methods from the domain of information retrieval are compared. The goal of the retrieval is to select from a library of pathology reports those ones that are most similar to a given report. The SNOMED codes that accompany these reports are presented to the pathologist who has to code the given report with the aim to improve the quality of coding. The reports were represented either as a vector of words or as a vector of N-grams. Both 4-, 5- and 6-grams were used. The similarity of the reports was determined by comparing the SNOMED terms that were added to the reports. It could be concluded that the word-based method was consistently better than the N-gram method. © 2000 Elsevier Science Ireland Ltd. All rights reserved. Keywords: N-gram; Information theory; SNOMED; Coding www.elsevier.com/locate/cmpb 1. Introduction Clinical pathology is a diagnostic service — at the pathology department tissue or cell samples are examined on order of an attending physician. The result of the examination is reported and included in the patient record. Apart from being described in a report the diagnostic findings are also summarised in the so-called diagnosis line following the report, using terminology from a restricted vocabulary. In the Netherlands excerpts of all examinations (including the diagnosis line) are sent to the PALGA foundation (PALGA stands for Dutch Network and National Data- base for Pathology), where they are stored in a database after several checks on the data have been performed. Pathologists founded the PALGA network in 1971. Nation wide coverage of 100% (70 laboratories) was reached around 1990. In 1996 the database contained about 20 000 000 excerpts, with an annual increase of about 2 000 000. The pathologists have adopted a version of SNOMED (Systematized Nomenclature of Medicine, [1]) to represent their findings in a formal way. Since SNOMED is a multi-axial cod- ing system [2] the pathologists have to record terms for all appropriate axes. At least they should code the topology axis. When using the * Corresponding author. 0169-2607/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved. PII:S0169-2607(00)00056-0