The Use of Referential Constraints in Structuring Discourse Violeta Sereţan ♦ , Dan Cristea ♣ ♦ University of Geneva, Language Technology Laboratory 2, rue de Candolle, CH-1211 Genève 4, Switzerland Violeta.Seretan@lettres.unige.ch ♣ "Al.I.Cuza" University of Iaşi, Dept. of Computer Science 16, Berthelot St., 6600 – Iaşi, Romania dcristea@infoiasi.ro Abstract The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high precision of the existent anaphora resolution systems. We present an approach based on the Veins Theory (Cristea, Ide, Romary, 1998), in which successful reference annotations of texts are exploited in order to improve arbitrary structural analyses; in this way, the large amount of corpora annotated at reference level can be used for the acquisition of discourse structure annotation resources. 1. 2. Introduction Discourse analysis is an important but difficult task, requiring a lot of effort in order to capture the relations between text constituents and thus to decode the author's communicative intentions. The popular theory of discourse structure, Rhetorical Structure Theory (RST; Mann and Thompson, 1988) is proved to account for the manner in which the text is organized in correspondence with the speaker's intention. Recent research on the relationship between the structure of the text and intentions showed (Moser and Moore, 1996; Marcu, 1999) the similarity between the intention- based discourse structure (Grosz and Sidner, 1986) and the RST tree-like structure built for a text. Despite its popularity and usability, RST lacks important prescriptions on criteria for the hierarchical aggregation of the pieces of text, this making structural analysis an ambiguous task and leading to inappropriate text interpretations. Many elements provide hints on the discourse structure, e.g. delimiters, cue-phrases, time etc., that are extensively used for the automatic computation of the coherence relations (Marcu 1997, Kurohashi and Nagao, 1997). We will focus on those elements that better indicate the structure of a text: the co-references in it. A reference from an anaphor to its antecedent indicates a structural relation between the textual units involved. The referential chains in discourse (the repeated references to the same discourse entity) contain important information about the text organization; therefore, they should also be considered when structuring discourse. While largely agreed that there is a straight relation between the references in a text and its structure (Fox, 1987; Vonk et al., 1992), up to now most of the attention was concentrated in only one direction, i.e. on the way in which the process of anaphora resolution is influenced by the hierarchical organization of the text. For example, in (Fox, 1987) it is indicated that the treatment of anaphora should consider the hierarchal structure of texts, and in (Cristea et al., 2000) it is shown that the anaphora resolution can benefit from following a discourse organization. We concentrate, instead, on the way the use of referring expressions restricts the discourse interpretation, and we intend to use them as disambiguation clues during structure derivation, or as structural constraints for correcting arbitrary (possibly automatically generated) analyses. A reliable reference annotation can be used to impose constraints on the partial or complete discourse structure built for a text, by means of the prescriptions on the relationship between the discourse structure and references stated in Veins Theory (VT; Cristea, Ide, Romary, 1998). According to these prescriptions, the reference chains from text are associated to sets of structurally related units, the "veins" of discourse. The references from a given unit are mostly to preceding units that are contained in the unit's vein. In a well-formed structure, anaphora are expected to be resolved along the veins. If this is not the case, it is likely that errors have occurred in the process of interpretation, which disrupted the structural connection, i.e. the path given by the vein, between the pairs of concerned units. The goal of our approach is to systematically detect and correct the structural errors which are likely to occur during the structural annotation, and which can be signalled by the resolution of anaphors outside their domains of accessibility as given by the structure. The next section briefly revises the Veins Theory main concepts and ideas on which our approach is based. It is followed by a section that describes in which way the VT prescriptions are used in better structuring discourse, and presents a method for the local and global correction of a structure. Afterwards, we present the results of an empirical study on a corpus of texts, the conclusions it allows us to draw and, finally, the comparison with the related work. Global Discourse Cohesion in VT Veins Theory extends and formalizes the relation between discourse structure and reference proposed by Fox (1987). Its central notion is the "vein", defined over discourse structure trees built according to the RST requirements.