George Hripcsak, MD, MS John H. M. Austin, MD Philip O. Alderson, MD Carol Friedman, PhD Index terms: Computers Picture archiving and communication system (PACS) Quality assurance Thorax, radiography, 60.1215 Published online before print 10.1148/radiol.2241011118 Radiology 2002; 224:157–163 Abbreviation: ROC = receiver operating characteristic 1 From the Departments of Medical Informatics (G.H., C.F.) and Radiology (J.H.M.A., P.O.A.), Columbia Univer- sity, 622 W 168th St, VC-5, New York, NY 10032; and Department of Com- puter Science, Queens College, City University of New York (C.F.). Re- ceived June 29, 2001; revision re- quested July 27; revision received Sep- tember 27; accepted November 12. Supported by National Library of Medicine grants R01-LM06910, R01- LM06274, and R29-LM05627. Ad- dress correspondence to G.H. (e- mail: hripcsak@columbia.edu). C.F. is named in a patent held by Columbia University for the natural language processor described in this report. © RSNA, 2002 Author contributions: Guarantor of integrity of entire study, G.H.; study concepts and design, all authors; literature research, all au- thors; clinical studies, G.H.; data ac- quisition, G.H., C.F.; data analysis/ interpretation, all authors; statistical analysis, G.H.; manuscript prepara- tion, G.H.; manuscript deﬁnition of intellectual content, all authors; manu- script editing, G.H.; manuscript revi- sion/review and ﬁnal version approval, all authors. Use of Natural Language Processing to Translate Clinical Information from a Database of 889,921 Chest Radiographic Reports 1 PURPOSE: To evaluate translation of chest radiographic reports by using natural language processing and to compare the ﬁndings with those in the literature. MATERIALS AND METHODS: A natural language processor coded 10 years of narrative chest radiographic reports from an urban academic medical center. Cod- ing for 150 reports was compared with manual coding. Frequencies and co- occurrences of 24 clinical conditions (diseases, abnormalities, and clinical states) were estimated. The ratio of right to left lung mass, association of pleural effusion with other conditions, and frequency of bullet and stab wounds were compared with independent observations. The sensitivity and speciﬁcity of the system’s pneu- mothorax coding were compared with those of manual ﬁnancial coding. RESULTS: The system coded 889,921 reports on 251,186 patients. On the basis of manual coding of 150 reports, the processor’s sensitivity (0.81) and speciﬁcity (0.99) were comparable to those previously reported for natural language processing and for expert coders. The frequencies of the selected conditions ranged from 0.22 for pleural effusion to 0.0004 for tension pneumothorax. The database conﬁrmed earlier observa- tions that lung cancer occurs in a 3:2 right-to-left ratio. The association of pleural effusion with other conditions mirrored that in the literature. Bullet and stab wounds decreased during 10 years at a rate consistent with crime statistics. A review of pneu- mothorax cases showed that the database (sensitivity, 1.00; speciﬁcity, 0.996) was more accurate than ﬁnancial discharge coding (sensitivity, 0.17; P = .002; speciﬁcity, 0.996; not signiﬁcant). CONCLUSION: Internal and external validation in this study conﬁrmed the accuracy of natural language processing for translating chest radiographic narrative reports into a large database of information. © RSNA, 2002 Attempts to create electronic clinical databases have been limited by a lack of accurate information (1). Administrative databases, although they can be huge, lack clinical truth (2,3). Medical practice generates a large amount of clinical data in narrative form—notes, summaries, and test reports— but its lack of standardized structure hinders its use for aggregate analysis or for real-time automated systems (4). Natural language processing (5–11) offers a solution. It converts machine-readable narrative text into a structured form. For example, a natural language processor might code this excerpt from a radiographic report, “Improved patchy opacity in the left lower lobe, no effusions seen,” as follows: Finding, opacity; descriptor, patchy; body location, left lower lobe of lung; change, better; ﬁnding, pleural effusion; certainty, no. This structured format allows the data to be used for clinical research— generating and testing hypotheses with large samples and screening patients for studies on a large scale—and for clinical care by means of automatically generated alerts and reminders. Computer Applications 157 R adiology