Accident Analysis and Prevention 36 (2004) 165–171
Computerized coding of injury narrative data from the
National Health Interview Survey
Helen M. Wellman
a,*
, Mark R. Lehto
b
, Gary S. Sorock
a
, Gordon S. Smith
a
a
Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, MA 01748, USA
b
School of Industrial Engineering, Purdue University, 1287 Grissom Hall, West Lafayette, IN 47907-1287, USA
Received 15 June 2002; received in revised form 8 November 2002; accepted 18 November 2002
Abstract
Objective: To investigate the accuracy of a computerized method for classifying injury narratives into external-cause-of-injury and
poisoning (E-code) categories.
Methods: This study used injury narratives and corresponding E-codes assigned by experts from the 1997 and 1998 US National Health
Interview Survey (NHIS). A Fuzzy Bayesian model was used to assign injury descriptions to 13 E-code categories. Sensitivity, specificity
and positive predictive value were measured by comparing the computer generated codes with E-code categories assigned by experts.
Results: The computer program correctly classified 4695 (82.7%) of the 5677 injury narratives when multiple words were included as
keywords in the model. The use of multiple-word predictors compared with using single words alone improved both the sensitivity and
specificity of the computer generated codes. The program is capable of identifying and filtering out cases that would benefit most from
manual coding. For example, the program could be used to code the narrative if the maximum probability of a category given the keywords
in the narrative was at least 0.9. If the maximum probability was lower than 0.9 (which will be the case for approximately 33% of the
narratives) the case would be filtered out for manual review.
Conclusions: A computer program based on Fuzzy Bayes logic is capable of accurately categorizing cause-of-injury codes from injury
narratives. The capacity to filter out certain cases for manual coding improves the utility of this process.
© 2003 Elsevier Science Ltd. All rights reserved.
Keywords: Injury; Narrative text; E-code; Fuzzy Bayes
1. Introduction
Analysis of the circumstances surrounding an injury-
producing event are essential for determining injury mech-
anisms and guiding prevention efforts. Central to this effort
is the assignment of meaningful cause-of-injury codes for
data analysis and comparison. A widely used system for
coding causes of injury is the external-cause-of-injury and
poisoning (E-codes) of the World Health Organization’s
(WHO’s) International Classification of Diseases (ICD-9,
World Health Organization, 1977). Although ICD-9 cause
coding has limitations in the specificity of its codes (Sorock
et al., 1993), it provides a useful means of standardizing
external causes across different data sources (Williamson
et al., 2001).
Since 1957, the National Center for Health Statistics
(NCHS) has conducted the National Health Interview Sur-
*
Corresponding author. Tel.: +1-508-435-9061x206;
fax: +1-508-435-8136.
E-mail address: helen.wellman@libertymutual.com (H.M. Wellman).
vey (NHIS), which collects annual health survey data on
the US population. In 1997, the survey was redesigned
to include detailed questions about injuries including free
text narratives of the circumstances surrounding the injury
event (Warner et al., 2000). Trained coders hired by the
NCHS (experts) code this information into ICD-9 E-code
categories. Branching text questions were also added to ob-
tain more specific information about the circumstances for
certain injuries (e.g. motor vehicle crashes, gunshots, falls,
burns, and drownings).
The addition of narrative text information in electronic
format to injury databases can be a useful adjunct to epi-
demiological analysis and provide valuable information
(Sorock et al., 1997; Smith, 2001). Accident descriptions
can be used to identify and prioritize prevention efforts.
Grouping the data (or coding) is an essential part of the an-
alytic process. However, manual coding, especially on large
datasets can be burdensome and use up valuable resources.
Several papers have evaluated the benefits of coding and an-
alyzing narrative text using computer algorithms (Buckely
et al., 1993; Lehto and Sorock, 1996; Sorock et al., 1996,
0001-4575/$ – see front matter © 2003 Elsevier Science Ltd. All rights reserved.
doi:10.1016/S0001-4575(02)00146-X