A Study on Bilingually Informed Coreference Resolution Michal Novák Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské námˇ estí 25, CZ-11800 Prague 1 mnovak@ufal.mff.cuni.cz Abstract: Coreference is a basic means to retain coher- ence of a text that likely exists in every language. How- ever, languages may differ in how a coreference relation is manifested on the surface. A possible way how to mea- sure the extent and nature of such differences is to build a coreference resolution system that operates on a parallel corpus and extracts information from both language sides of the corpus. In this work, we build such a bilingually informed coreference resolution system and apply it on Czech-English data. We compare its performance with the system that learns only from a single language. Our results show that the cross-lingual approach outperforms the monolingual one. They also suggest that a system for Czech can exploit the additional English information more effectively than the other way round. The work concludes with a detailed analysis that tries to reveal the reasons be- hind these results. 1 Introduction Cross-lingual techniques are becoming still more and more popular. Even though they do not circumvent the task of Coreference Resolution (CR), the research is mostly limited to cross-lingual projection. Other cross- lingual techniques remain a largely unexplored area for this task. One of the yet neglected cross-lingual techniques is called bilingually informed resolution. It is an approach, in which decisions in a particular task are made based on the information from bilingual parallel data. Parallel texts must be available when a method is trained, but also at test time, that is when a trained model is applied to new data. In real-world scenarios, the availability of parallel data at test time requires the technique to apply a machine translation service to acquire them (MT-based bilingually informed resolution). Nevertheless, for limited purposes it may pay off to use human-translated parallel data instead (corpus-based bilin- gually informed resolution). If it outperforms the mono- lingual approach, it may be used in building automatically annotated parallel corpora. Such corpora with more reli- able annotation could be useful for corpus-driven theoret- ical research. 1 Furthermore, it can be also used for au- tomatic processing. For instance, improved resolution on 1 In case a cross-lingual origin of the annotation does not matter. big parallel data might be leveraged in a weakly supervised manner to boost the models trained in a monolingual way. The present work is concerned with corpus-based bilin- gually informed CR on Czech-English texts. Specifically, it focuses on resolution of pronouns and zeros, as these are the coreferential expressions whose grammatical and func- tional properties differ considerably across the languages. For instance, whereas in English most of non-living ob- jects are referred to with pronouns in neuter gender (e.g. it”, “its”), genders are distributed more evenly in Czech. Information on Czech genders thus may be useful to fil- ter out English candidates that are highly improbable to be coreferential with the pronoun. By comparison of its per- formance with a monolingual approach and by thorough analysis of the results, our work aims at discovering the extent and nature of such differences. The paper is structured as follows. After mentioning related work (Section 2), we introduce a coreference re- solver (Section 3), both its monolingual and cross-lingual variants. Section 4 describes the dataset used in experi- ments in Section 5. Before we conclude, the results of experiments are thoroughly analyzed (Section 6). 2 Related Work Building a bilingually informed CR system requires a par- allel corpus with at least the target-language side annotated with coreference. Even these days very few such corpora exist, e.g. Prague Czech-English Dependency Treebank 2.0 Coref [14], ParCor 1.0 [9] and parts of OntoNotes 5.0 [19]. It is thus surprising that the peak of popularity for such approach was reached around ten years before these cor- pora had been published. Harabagiu and Maiorano [10] present an heuristics-based approach to CR. The set of heuristics is expanded by exploiting the transitivity prop- erty of coreferential chains in a bootstrapping fashion. Moreover, they expand the heuristics even more, follow- ing mention counterparts in translations of source English texts to Romanian with coreference annotation. Mitkov and Barbu [13] adjust a rule-based pronoun coreference resolution system to work on a parallel corpus. After pro- viding a linguistic comparison of English and French pro- nouns and their behavior in discourse, the authors distill their findings into a set of cross-lingual rules to be inte- grated into the CR system. In evaluation, they observe im- S. Krajˇ ci (ed.): ITAT 2018 Proceedings, pp. 130–137 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Michal Novák