Accident Analysis and Prevention 36 (2004) 165–171 Computerized coding of injury narrative data from the National Health Interview Survey Helen M. Wellman a,* , Mark R. Lehto b , Gary S. Sorock a , Gordon S. Smith a a Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, MA 01748, USA b School of Industrial Engineering, Purdue University, 1287 Grissom Hall, West Lafayette, IN 47907-1287, USA Received 15 June 2002; received in revised form 8 November 2002; accepted 18 November 2002 Abstract Objective: To investigate the accuracy of a computerized method for classifying injury narratives into external-cause-of-injury and poisoning (E-code) categories. Methods: This study used injury narratives and corresponding E-codes assigned by experts from the 1997 and 1998 US National Health Interview Survey (NHIS). A Fuzzy Bayesian model was used to assign injury descriptions to 13 E-code categories. Sensitivity, specificity and positive predictive value were measured by comparing the computer generated codes with E-code categories assigned by experts. Results: The computer program correctly classified 4695 (82.7%) of the 5677 injury narratives when multiple words were included as keywords in the model. The use of multiple-word predictors compared with using single words alone improved both the sensitivity and specificity of the computer generated codes. The program is capable of identifying and filtering out cases that would benefit most from manual coding. For example, the program could be used to code the narrative if the maximum probability of a category given the keywords in the narrative was at least 0.9. If the maximum probability was lower than 0.9 (which will be the case for approximately 33% of the narratives) the case would be filtered out for manual review. Conclusions: A computer program based on Fuzzy Bayes logic is capable of accurately categorizing cause-of-injury codes from injury narratives. The capacity to filter out certain cases for manual coding improves the utility of this process. © 2003 Elsevier Science Ltd. All rights reserved. Keywords: Injury; Narrative text; E-code; Fuzzy Bayes 1. Introduction Analysis of the circumstances surrounding an injury- producing event are essential for determining injury mech- anisms and guiding prevention efforts. Central to this effort is the assignment of meaningful cause-of-injury codes for data analysis and comparison. A widely used system for coding causes of injury is the external-cause-of-injury and poisoning (E-codes) of the World Health Organization’s (WHO’s) International Classification of Diseases (ICD-9, World Health Organization, 1977). Although ICD-9 cause coding has limitations in the specificity of its codes (Sorock et al., 1993), it provides a useful means of standardizing external causes across different data sources (Williamson et al., 2001). Since 1957, the National Center for Health Statistics (NCHS) has conducted the National Health Interview Sur- * Corresponding author. Tel.: +1-508-435-9061x206; fax: +1-508-435-8136. E-mail address: helen.wellman@libertymutual.com (H.M. Wellman). vey (NHIS), which collects annual health survey data on the US population. In 1997, the survey was redesigned to include detailed questions about injuries including free text narratives of the circumstances surrounding the injury event (Warner et al., 2000). Trained coders hired by the NCHS (experts) code this information into ICD-9 E-code categories. Branching text questions were also added to ob- tain more specific information about the circumstances for certain injuries (e.g. motor vehicle crashes, gunshots, falls, burns, and drownings). The addition of narrative text information in electronic format to injury databases can be a useful adjunct to epi- demiological analysis and provide valuable information (Sorock et al., 1997; Smith, 2001). Accident descriptions can be used to identify and prioritize prevention efforts. Grouping the data (or coding) is an essential part of the an- alytic process. However, manual coding, especially on large datasets can be burdensome and use up valuable resources. Several papers have evaluated the benefits of coding and an- alyzing narrative text using computer algorithms (Buckely et al., 1993; Lehto and Sorock, 1996; Sorock et al., 1996, 0001-4575/$ – see front matter © 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0001-4575(02)00146-X