Turkish Discourse Bank: Ongoing Developments Işın Demirşahin*, Ayışığı Sevdik-Çallı*, Hale Ögel Balaban † , Ruket Çakıcı*, and Deniz Zeyrek* *Middle East Technical University Ankara, Turkey † İstanbul Bilgi University İstanbul, Turkey demirsahin@ii.metu.edu.tr, ayisigi@ii.metu.edu.tr, hogel@bilgi.edu.tr, ruken@ceng.metu.edu,tr, dezeyrek@metu.edu.tr Abstract This paper describes the first release of the Turkish Discourse Bank (the TDB), the first large-scale, publicly available language resource with discourse-level annotations for Turkish. We describe the features of the source corpus and the sub-corpus annotated for discourse connectives. We provide information about the annotations and other contents of the first release of the TDB. Finally, we describe the ongoing developments including annotating the sense and the class of the connectives, and the morphological features of the nominalized arguments of subordinating conjunctives. Keywords: Turkish, discourse bank, discourse connectives 1. Introduction Turkish Discourse Bank (the TDB) is the first large-scale publicly available language resource with discourse level annotations for Turkish. Following the style of Penn Discourse Tree Bank (PDTB) (Prasad et al., 2008), annotations include discourse connectives, modifiers and arguments of connectives, and supplementary materials for the arguments. In (1), a sample annotation is given. The connective is underlined; the first argument is in italics and the second argument in bold face. (1) İnsanlar tabiattan eşit doğarlar. Dolayısıyla özgür ve köle ayrılığı olmamalıdır. People are born equal by nature. As a result, there should be no such distinction as the freeman and the slave. The annotations were carried out using the tool designed specifically for the TDB (Aktaş, et al., 2010). The annotations were performed by either three independent annotators, or by a pair of annotators and an independent individual annotator (Zeyrek et al., 2010; Demirsahin et al, ms). 2. Contents of the First Release The TDB can be requested from www.tdb.ii.metu.edu.tr. The first release of the TDB includes the raw text files, annotation files, annotation guidelines, and a browser. 2.1. Text Files The TDB is built on a ~ 400,000-word sub-corpus of METU Turkish Corpus (the MTC) (Say et al., 2002). the MTC is a 2 million-word resource of post-1990 written Turkish from multiple genres. A total of 159 files, 83 columns and 76 essays were excluded from the TDB, because these genres lack the conventional paragraph structure and make extensive use of boldface. These characteristics were not transferred to the MTC, which might have interfered with the reliable interpretation of the discourse relations and the specification of the extent of the arguments. For the rest of the genres, the TDB preserves the genre distribution of the MTC, as shown in Table 1. the MTC the TDB Genre # % # % Novel 123 15.63 31 15.74 Story 114 14.49 28 14.21 Research/Survey 49 6.23 13 6.60 Article 38 4.83 9 4.57 Travel 19 2.41 5 2.54 Interview 7 0.89 2 1.02 Memoir 18 2.29 4 2.03 News 419 53.24 105 53.30 Total 787 100 197 100 Table 1: Distribution of the genres in the MTC and the TDB 2.2. Annotations For each annotated text span, the text and the offsets for the beginning and the end of the span are kept in a standoff XML file. All tags except NOTE denote text spans. The annotation files include the content text and the beginning and end offsets for text spans. A sample XML tree for the connective span of (1) is provided in (2). (2) <Conn> <Span> <Text>dolayisiyla</Text> <BeginOffset>15624</BeginOffset> <EndOffset>15635</EndOffset> </Span> </Conn>