Computer Methods and Programs in Biomedicine 62 (2000) 109 – 113
Supporting the classification of pathology reports:
comparing two information retrieval methods
L.M. de Bruijn
a,
*, A. Hasman
a
, J.W. Arends
b
a
Department of Medical Informatics, Uniersity of Maastricht, Maastricht, The Netherlands
b
Department of Pathology, Academic Hospital, Maastricht, Maastricht, The Netherlands
Received 29 June 1999; accepted 4 January 2000
Abstract
In this contribution two methods from the domain of information retrieval are compared. The goal of the retrieval
is to select from a library of pathology reports those ones that are most similar to a given report. The SNOMED
codes that accompany these reports are presented to the pathologist who has to code the given report with the aim
to improve the quality of coding. The reports were represented either as a vector of words or as a vector of N-grams.
Both 4-, 5- and 6-grams were used. The similarity of the reports was determined by comparing the SNOMED terms
that were added to the reports. It could be concluded that the word-based method was consistently better than the
N-gram method. © 2000 Elsevier Science Ireland Ltd. All rights reserved.
Keywords: N-gram; Information theory; SNOMED; Coding
www.elsevier.com/locate/cmpb
1. Introduction
Clinical pathology is a diagnostic service — at
the pathology department tissue or cell samples
are examined on order of an attending physician.
The result of the examination is reported and
included in the patient record. Apart from being
described in a report the diagnostic findings are
also summarised in the so-called diagnosis line
following the report, using terminology from a
restricted vocabulary. In the Netherlands excerpts
of all examinations (including the diagnosis line)
are sent to the PALGA foundation (PALGA
stands for Dutch Network and National Data-
base for Pathology), where they are stored in a
database after several checks on the data have
been performed. Pathologists founded the
PALGA network in 1971. Nation wide coverage
of 100% (70 laboratories) was reached around
1990. In 1996 the database contained about
20 000 000 excerpts, with an annual increase of
about 2 000 000.
The pathologists have adopted a version of
SNOMED (Systematized Nomenclature of
Medicine, [1]) to represent their findings in a
formal way. Since SNOMED is a multi-axial cod-
ing system [2] the pathologists have to record
terms for all appropriate axes. At least they
should code the topology axis. When using the * Corresponding author.
0169-2607/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved.
PII:S0169-2607(00)00056-0