1 A Corpus Study of Referential Choice: The Role of Rhetorical Structure Andrej A. Kibrik (Institute of Linguistics, Russian Academy of Sciences) Olga N. Krasavina (Moscow State University and Humboldt University of Berlin) kibrik@comtv.ru ; krasavio@rz.hu-berlin.de Abstract This study shares the view that reference in discourse is influenced by the distance to prior mentions of the referent in the discourse. Kibrik (1996, 1999) suggested a measurement of rhetorical distance to assess this factor. In this paper we address three complications created by that methodology when applied to a large corpus of written newspaper texts. These problems include: difference between symmetrical and asymmetrical discourse structures as sites of antecedents; type of rhetorical relations as a factor or referential choice; and multiple (competing) antecedents. In this study we outline a model that is relevant for both theory of referential choice in discourse and applied explorations of anaphora resolution or generation. 1. Introduction: “Will rhetorical structure redeem us?” Pronominal anaphora has been one of the most favorite study subjects of diverse theoretical frameworks over the years. A growing number of studies in anaphora, especially anaphora resolution, are a characteristic feature of the last decade. However, at the present moment both theoretical and applied approaches are facing a sort of stagnation. Anaphora theorists still do not have a model capable of explaining and/or predicting the use of basic anaphoric devices, validated on large natural language corpora, and computational linguists do not get any significant improvements in their resolution algorithms. Hierarchical, or rhetorical, structure of discourse is a possibly important but still not sufficiently studied factor that impacts the use of anaphoric devices. There is some evidence for this (Fox 1987, Grosz and Sidner 1986, etc.), the quote in the title is but one of the cries from the heart in the anaphora research community (see Wolters 2001). Unfortunately, the existing heuristics of anaphoric devices use are often too rough (cf. Cristea et al. 1998, 2000). The present study attempts to solve this problem by investigating the following discourse structural features: • distance to the antecedent • semantic types of rhetorical relation between clauses • choice between two or more potential antecedents. In this study we approach referential phenomena from the production perspective. That is, we are interested in referential choice by the speaker rather than reference resolution. This study builds on a corpus of newspaper American English – the RST Discourse Treebank (Carlson et al. 2003) annotated for rhetorical structure (following the principles of Rhetorical Structure Theory, see Mann and Thompson 1988). In Section 2, a terse description of Rhetorical Structure Theory follows. Section 3 discusses the role of rhetorical structure in referential choice. In order to evaluate the role of discourse structure we employ the parameter of rhetorical distance proposed in Kibrik (1996, 1999) (section 4). A number of improvements to the prior procedure are discussed in sections 5 to 7. Section 8 concludes this communication. 2. Rhetorical Structure Theory Rhetorical Structure Theory, or RST (Mann and Thompson 1988), is one of the most widely- used tools to assess discourse coherence, thus bringing a global and local structure of discourse together. According to the RST, discourse is divided into discourse units. Elementary discourse units essentially coincide with clauses. The RST assumes a number of rhetorical relations between discourse units which can be either symmetrical (multinuclear) or asymmetrical (mononuclear). An asymmetrical relation connects a nucleus and a satellite, and a symmetrical relation connects two or more nuclei. Rhetorical relations resemble semantic relations between the main and adjunct clauses in complex sentences, but extend to the discourse level, that is, can connect discourse units irrespective