International Journal of Medical Informatics 53 (1999) 1 – 28 Discourse structures in medical reports — Watch out! The generation of referentially coherent and valid text knowledge bases in the MEDSYNDIKATE system Udo Hahn a, *, Martin Romacker a,b , Stefan Schulz a,b a Freiburg Uniersity, Computational Linguistics Lab, Werthmannplatz 1, D-79085 Freiburg, Germany b Department of Medical Informatics, Freiburg Uniersity Hospital, Stefan -Meier -Str. 26, D-79104 Freiburg, Germany Received 15 February 1998; received in revised form 20 March 1998; accepted 25 March 1998 Abstract The automatic analysis of medical narratives currently suffers from neglecting text structure phenomena such as referential relations between discourse units. This has unwarranted effects on the descriptional adequacy of medical knowledge bases automatically generated from texts. The resulting representation bias can be characterized in terms of incomplete, artificially fragmented and referentially invalid knowledge structures. We focus here on four basic types of textual reference relations, iz. pronominal and nominal anaphora, textual ellipsis and metonymy and show how to deal with them in an adequate text parsing device. Since the types of reference relations we discuss show an increasing dependence on conceptual background knowledge, we stress the need for formally grounded, expressive conceptual representation systems for medical knowledge. Our suggestions are based on experience with MEDSYN- DIKATE, a medical text knowledge acquisition system designed to properly deal with various sorts of discourse structure phenomena. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Natural language processing: text understanding; Knowledge acquisition from texts; Knowledge represen- tation: description logics; Ontology and terminology: pathology domain 1. Introduction With the overall diffusion of electronic text processing technology in clinical offices and at the physician’s workplace and, more re- cently, the unlimited access to text resources in the Internet, a vast potential for medical information supply arises. The natural lan- guage processing community, therefore, faces the challenge to meet the requirements of cursory as well as in-depth analysis of large * Corresponding author. Tel.: +49 761 2033255; fax: +49 761 2033251; e-mail: hahn@coling.uni-freiburg.de 1386-5056/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII S1386-5056(98)00091-4