George Hripcsak, MD, MS
John H. M. Austin, MD
Philip O. Alderson, MD
Carol Friedman, PhD
Index terms:
Computers
Picture archiving and communication
system (PACS)
Quality assurance
Thorax, radiography, 60.1215
Published online before print
10.1148/radiol.2241011118
Radiology 2002; 224:157–163
Abbreviation:
ROC = receiver operating
characteristic
1
From the Departments of Medical
Informatics (G.H., C.F.) and Radiology
(J.H.M.A., P.O.A.), Columbia Univer-
sity, 622 W 168th St, VC-5, New York,
NY 10032; and Department of Com-
puter Science, Queens College, City
University of New York (C.F.). Re-
ceived June 29, 2001; revision re-
quested July 27; revision received Sep-
tember 27; accepted November 12.
Supported by National Library of
Medicine grants R01-LM06910, R01-
LM06274, and R29-LM05627. Ad-
dress correspondence to G.H. (e-
mail: hripcsak@columbia.edu).
C.F. is named in a patent held by
Columbia University for the natural
language processor described in this
report.
©
RSNA, 2002
Author contributions:
Guarantor of integrity of entire study,
G.H.; study concepts and design, all
authors; literature research, all au-
thors; clinical studies, G.H.; data ac-
quisition, G.H., C.F.; data analysis/
interpretation, all authors; statistical
analysis, G.H.; manuscript prepara-
tion, G.H.; manuscript definition of
intellectual content, all authors; manu-
script editing, G.H.; manuscript revi-
sion/review and final version approval,
all authors.
Use of Natural Language
Processing to Translate
Clinical Information from a
Database of 889,921 Chest
Radiographic Reports
1
PURPOSE: To evaluate translation of chest radiographic reports by using natural
language processing and to compare the findings with those in the literature.
MATERIALS AND METHODS: A natural language processor coded 10 years of
narrative chest radiographic reports from an urban academic medical center. Cod-
ing for 150 reports was compared with manual coding. Frequencies and co-
occurrences of 24 clinical conditions (diseases, abnormalities, and clinical states)
were estimated. The ratio of right to left lung mass, association of pleural effusion
with other conditions, and frequency of bullet and stab wounds were compared
with independent observations. The sensitivity and specificity of the system’s pneu-
mothorax coding were compared with those of manual financial coding.
RESULTS: The system coded 889,921 reports on 251,186 patients. On the basis of
manual coding of 150 reports, the processor’s sensitivity (0.81) and specificity (0.99)
were comparable to those previously reported for natural language processing and for
expert coders. The frequencies of the selected conditions ranged from 0.22 for pleural
effusion to 0.0004 for tension pneumothorax. The database confirmed earlier observa-
tions that lung cancer occurs in a 3:2 right-to-left ratio. The association of pleural
effusion with other conditions mirrored that in the literature. Bullet and stab wounds
decreased during 10 years at a rate consistent with crime statistics. A review of pneu-
mothorax cases showed that the database (sensitivity, 1.00; specificity, 0.996) was
more accurate than financial discharge coding (sensitivity, 0.17; P = .002; specificity,
0.996; not significant).
CONCLUSION: Internal and external validation in this study confirmed the accuracy
of natural language processing for translating chest radiographic narrative reports
into a large database of information.
©
RSNA, 2002
Attempts to create electronic clinical databases have been limited by a lack of accurate
information (1). Administrative databases, although they can be huge, lack clinical truth
(2,3). Medical practice generates a large amount of clinical data in narrative form—notes,
summaries, and test reports— but its lack of standardized structure hinders its use for
aggregate analysis or for real-time automated systems (4).
Natural language processing (5–11) offers a solution. It converts machine-readable
narrative text into a structured form. For example, a natural language processor might
code this excerpt from a radiographic report, “Improved patchy opacity in the left lower
lobe, no effusions seen,” as follows: Finding, opacity; descriptor, patchy; body location,
left lower lobe of lung; change, better; finding, pleural effusion; certainty, no. This
structured format allows the data to be used for clinical research— generating and testing
hypotheses with large samples and screening patients for studies on a large scale—and for
clinical care by means of automatically generated alerts and reminders.
Computer Applications
157
R adiology