Natural language understanding in road accident data analysis J. Wu & B. G. Heydecker* Centre for Transport Studies, University College London, Gower Street, London WC1E 6BT, UK (Received 2 November 1995; accepted 21 November 1997) Road accident records in Britain each comprise two components: coded data in a predefined format, and plain English in free format. This paper describes a natural language understanding system for information retrieval from the latter to verify and extend the former. We adopt the description logic system BACK to achieve a common representation of information from each of the two sources to facilitate comparison. A sub-category grammar is adapted to achieve automatic classification in BACK, and a bidirectional chart parser is adapted to operate with this grammar. This gives good independence between grammar rules, and provides flexibility, expressiveness, and the ability to resolve ambiguities. 1998 Elsevier Science Ltd and Civil-Comp Ltd. All rights reserved. Key words: natural language understanding, description logic, road accident analysis. 1 INTRODUCTION 1.1 Requirements of information retrieval from the English text In Britain, road accidents involving personal injury are recorded as computer readable ASCII files in a standard format (known as STATS 19) that comprises two compo- nents: numerically coded information in a large number of fields which provides details of a range of aspects of the accident, and plain English text in two fields which describe respectively the location of the accident and the events associated with it. This data set is one of the main sources of the information required for a range of road accident analyses. 1,2 A number of limitations in the coded information have been highlighted in earlier work, 1–3 including ambiguity in locational coding, absence of pertinent information, and inconsistencies both within and between accident. To over- come these difficulties currently requires intensive manual checking and validation of accident groups: in this process, use is normally made of the plain English text as it is generally considered to be more reliable than the coded data. The objective of the research presented here is to inves- tigate the application of computer-based natural language understanding (NLU) techniques to the automatic extraction of information from the plain English description of accident location and occurrence in STATS 19 accident records. Information extracted in this way can be used to support, validate and augment the coded data with minimum human intervention. Furthermore, the plain English text can be used to overcome limitations which inevitably arise in rigid coding systems: relevant features of sites and acci- dents can be recorded in the plain text whether or not their occurrence has been anticipated. This automation of some of the more routine aspects of the analysis of road accident data will enable users of it to spend their time more effec- tively during subsequent and more skilled parts of their investigations. 1.2 Characteristics of the application and the system The linguistic style of the plain English text to be processed is unusual in that it uses only a small subset of the language that has a fairly restricted lexicon and grammar, and thus is known as a sublanguage. It is also characterized by exten- sive use of abbreviations, specialized jargon, and deviant grammatical forms (missing words, lack of punctuation etc). The following is a typical example the English descrip- tions appearing in road accident records. Advances in Engineering Software Vol. 29, No. 7–9, pp. 599–610, 1998 1998 Elsevier Science Ltd and Civil-Comp Ltd Printed in Great Britain. All rights reserved 0965-9978/98/$19.00 + 0.00 PII: S 0 9 6 5 - 9 9 7 8 ( 9 8 ) 0 0 0 2 5 - 8 ADES 367 599 *Author to whom all correspondence should be addressed.