Natural Language Engineering 8 (2/3): 235–255. c 2002 Cambridge University Press DOI: 10.1017/S1351324902002905 Printed in the United Kingdom 235 Robust discourse parsing via discoure markers, topicality and position FRANK SCHILDER Department for Informatics, University of Hamburg, Vogt-K¨ olln-Str. 30, 22527 Hamburg, Germany (Received 15 July 2001; revised 11 February 2002 ) Abstract This paper describes a simple discourse parsing and analysis algorithm that combines a formal underspecification utilising discourse grammar with Information Retrieval (IR) techniques. First, linguistic knowledge based on discourse markers is used to constrain a totally under- specified discourse representation. Then, the remaining underspecification is further specified by the computation of a topicality score for every discourse unit. This computation is done via the vector space model. Finally, the sentences in a prominent position (e.g. the first sentence of a paragraph) are given an adjusted topicality score. The proposed algorithm was evaluated by applying it to a text summarisation task. Results from a psycholinguistic experiment, indicating the most salient sentences for a given text as the ‘gold standard’, show that the algorithm performs better than commonly used machine learning and statistical approaches to summarisation. 1 Introduction The output of a discourse parser is a discourse tree that reflects the rhetorical structure of the input text. Obtaining robustness for a discourse parser is a demanding task due to the many unresolved theoretical issues regarding the derivation of the discourse structure. Although formal model-theoretic approaches such as Discourse Representation Theory (DRT) (Kamp and Reyle 1993) or its extension, Segmented DRT by Asher (1993), can provide a detailed analysis of the content of larger texts, this can only be done when world knowledge is specified a priori in very great detail. However, a world knowledge representation system encompassing the knowledge needed to understand, for instance, a newspaper article, does not yet exist. As a consequence, other means have to be found for a robust rhetorical parser. The outcome of a discourse parser is a hierarchical discourse structure repre- senting the rhetorical information of the text. The discourse structure is not only of theoretical interest, deriving the discourse tree structure can also improve per- formance of Information Extraction (IE) tasks or IR applications such as text summarisation (Sumita et al. 1992; Marcu 1999b). Thus, a robust rhetorical parsing algorithm that accurately determines the discourse structure of text would be useful for many practical applications.