Informarm Processrng & Monagemenr Vol. 26, No. 1, pp. 135-170, 1990 0306.4573/90 $3.00 + .OO Printed in Great Britain. Copyright 0 1990 Pergamon Press plc zyxwvutsrqp TOPIC PARSING: ACCOUNTING FOR TEXT MACRO STRUCTURES IN FULL-TEXT ANALYSIS UDO HAHN Fakulttit fiir Mathematik und Informatik, Universitlt Passau, Postfach 2540, D-8390 Passau, F.R. Germany zyxwvutsrqponmlkjihgfedcbaZYXW (Received 21 March 1989; accepted in final form 30 August 1989) Abstract-The rapid proliferation of full-text databases poses serious problems to the natural language processing components of information retrieval systems. Not taking text-level phenomena of written natural language discourse into account causes a marked decrease of performance for many text information system applications. Consequently, appropriate text parsing facilities must be capable of recognizing the rich internal struc- ture of full-texts on lower levels of text connectivity as well as on the global organiza- tional level of text coherence. This paper introduces such a parser which is based on the conceptual knowledge of its domain and is organized as a collection of distributed lex- icalized grammar modules (word experts) which communicate through message-passing. Emphasis is put on text grammatical specifications which state formal conditions for rec- ognizing higher-order text constituents and their coherent configuration on the global level of textual macro organization. 1. THE FULL-TEXT PROBLEM IN INFORMATION RETRIEVAL The rapid advance of electronic text production and distribution technology has created enthusiasm with regard to the availability, ease of access, and dissemination of informa- tion contained in large electronic text files [l]. Unlike past generations of bibliographic information retrieval systems dealing exclusively with document surrogates such as abstracts, title headings, or keywords, today’s document databases allow the immediate manipulation of source texts, i.e., full-texts such as technical reports, letters, memos, and magazine articles. Unfortunately, conventional information retrieval procedures applying string-based pattern-matching, boolean/adjacency retrieval operators, or thesaurus/clas- sification based domain models face a dramatical decrease of retrieval performance in full- text environments [2]. Even if these tools were adequate, the rich information potential inherent in full-text databases seems wasted when accessed for the purpose of bibliographic reference retrieval only. Considering more advanced applications we anticipate future information retrieval systems to be sophisticated full-text processing machines with exten- sions to other media (e.g., linking document fragments, document versions, or critical annotations to documents in hypertext/hypermedia environments [3]). Their natural lan- guage processing devices will therefore have to account for the logical organization of full- texts on a conceptual (knowledge) level intended to structure the content portions they contain. The presumed potential inherent to these full-text databases has already been recog- nized and visionary (text) knowledge workbenches have been described that anticipate rather elaborated devices for in-depth document analysis - full-blown electronic encyclope- dias [4] in terms of encyclopedic expert systems [5], sophisticated concept elaboration tools that provide explanatory links among full-text sources and encyclopedic resources [6], or advanced question-answering facilities on top of inferential text knowledge bases [7,8]. Natural language processing methodology for information retrieval applications cur- rently offers three main approaches to the automatic content analysis of large expository full-texts. Statistical models involve frequency counts of document terms for indexing, doc- ument clustering based on term association factors for classification, and provide proba- bilistic relevance measures or evidential reasoning calculi for text retrieval [9-121. Any one of these approaches provides keyword-level analyses of texts on which reference and/or 135