E NHANCEMENTS TO THE BOUN T REEBANK R EFLECTING THE AGGLUTINATIVE NATURE OF T URKISH us ¸ra Mars ¸an 1 busra.marsan@boun.edu.tr Salih Furkan Akkurt 2 furkan.akkurt@boun.edu.tr Muhammet S ¸en 2 muhammet.sen@boun.edu.tr Merve G¨ urb¨ uz 2 merve.gurbuz@boun.edu.tr Onur G ¨ ung¨ or 2 onurgu@boun.edu.tr S ¸ aziye Bet ¨ ul ¨ Ozates ¸ 2 saziye.bilgin@boun.edu.tr Suzan ¨ Usk¨ udarlı 2 suzan.uskudarli@boun.edu.tr Arzucan ¨ Ozg¨ ur 2 arzucan.ozgur@boun.edu.tr Tunga G¨ ung¨ or 2 gungort@boun.edu.tr Balkız ¨ Ozt¨ urk 1 balkiz.ozturk@boun.edu.tr 1. Linguistics Department, Bo ˘ gazic ¸i University 2. Computer Engineering Department, Bo ˘ gazic ¸i University ABSTRACT In this study, we aim to offer linguistically motivated solutions to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic mor- phemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework. In order to tackle these issues, new annotation conventions were introduced by split- ting certain lemmas and employing the MISC (miscellaneous) tab in the UD framework to denote derivation. Representational capabilities of the re-annotated treebank were tested on a LSTM-based dependency parser and an updated version of the BoAT Tool is introduced. Keywords Universal Dependencies · Turkish · morphological analysis · dependency annotation · dependency parsing 1 Introduction Following the dependency grammar framework first proposed by Tesni´ ere [20], dependency trees illustrate how sen- tence elements relate to one another through head and dependent relations. Universal Dependencies 1 (UD) is an inter- national cooperative treebank project based on the dependency grammar framework and it aims to offer a standardized and comprehensive dependency treebank collection covering 121 languages. With the addition of new UD treebanks, Turkish does not qualify as a low resource language anymore. With a total of 733,000 tokens, it is the 12th largest UD treebank in the UD repository. Although the coverage of the treebanks plays an essential role in improving the performance of natural language processing (NLP) systems [8], their ability to correctly and consistently illustrate the morphosyntactic features of the target language should not be overlooked. As Vincze et al. [22]’s study shows, the better a treebank’s ability to represent the morphology and syntax of the target language, the better the performance of the NLP systems using that treebank as a resource. In this paper, we aim to abide by the linguistic framework set by Bedir et al. [3] and offer an updated and comprehen- sive UD treebank for Turkish, the BOUN Treebank, along with an improved UD annotation interface, the BoAT Tool, first introduced in T ¨ urk et al. [21]. The decisions made in the re-annotation process of the BOUN Treebank aim to offer solutions to the issues posed by the morphologically rich and complex nature of Turkish: null morphemes are frequently employed, agglutinative processes are heavily used to create new forms, and numerous morphemes like copula and -ki are very syncretic. The main goal of this study is to illustrate these phenomena without compromising the compliance with the UD framework. 1 https://universaldependencies.org arXiv:2207.11782v1 [cs.CL] 24 Jul 2022