Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 127–132, Valencia, Spain, April 4. c 2017 Association for Computational Linguistics The ATILF-LLF System for Parseme Shared Task: a Transition-based VerbalMultiword Expression Tagger Hazem Al Saied Universit´ e de Lorraine, ATILF, CNRS Nancy, France halsaied@atilf.fr Marie Candito Universit´ e Paris Diderot, LLF Paris, France marie.candito@linguist.univ-paris-diderot.fr Matthieu Constant Universit´ e de Lorraine, ATILF, CNRS Nancy, France Mathieu.Constant@univ-lorraine.fr Abstract We describe the ATILF-LLF system built for the MWE 2017 Shared Task on au- tomatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available lan- guages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transi- tion. The system was meant to accom- modate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntac- tic information. Using per-MWE Fscore, the system was ranked first 1 for all but two languages (Hungarian and Romanian). 1 Introduction Verbal multi-word expressions (hereafter VMMEs) tend to exhibit more morphological and syntactic variation than other MWEs, if only because in general the verb is inflected, and it can receive adverbial modifiers. Furthermore some VMWEs, in particular light verb constructions (one of the VMWE categories provided in the shared task), allow for the full range of syntactic variation (extraction, coordination etc...). This renders the VMWE identification task even more challenging than general MWE identification, in which fully frozen and contiguous expressions help increasing the overall performance. The data sets are quite heterogeneous, both in terms of the number of annotated VMWEs and of accompanying resources (for the closed track). 2 1 2 systems participated for one language only (French), and 5 systems participated for more than one language. 2 Some of the data sets contain the tokenized sentences plus VMWEs only (BG, ES, HE, LT), some are accompanied with morphological information such as lemmas and POS So our first priority when setting up the architec- ture was to build a generic system applicable to all the 18 languages, with limited language-specific tuning. We thus chose to participate in the closed track only, relying exclusively on training data, ac- companying CoNLL-U file when available, and basic feature engineering. We developed a one- pass greedy transition-based system, which we be- lieve can handle discontinuities elegantly. We in- tegrated more or less informed feature templates, depending on their availability in the data. We describe our system in section 2, the exper- imental setup in section 3, the results in section 4 and the related works in section 5. We conclude in section 6 and give perspectives for future work. 2 System description The identification system we used is a simpli- fied and partial implementation of the system pro- posed in Constant and Nivre (2016), which is in itself a mild extension of an arc-standard depen- dency parser (Nivre, 2004). Constant and Nivre (2016) proposed a parsing algorithm that jointly predicts a syntactic dependency tree and a forest of lexical units including MWEs. In particular, in line with Nivre (2014), this system integrates spe- cial parsing mechanisms to deal with lexical anal- ysis. Given that the shared task focuses on the lex- ical task only and that datasets do not always pro- vide syntactic annotations, we have modified the structure of the original system by removing syn- tax prediction, in order to use the same system for all 18 languages. A transition-based system consists in applying a sequence of actions (namely transitions) to incrementally build the expected output struc- ture in a bottom-up manner. Each transition is (CS, MT, RO, SL), and for the third group (the 10 remaining languages) full dependency parses are provided. See (Savary et al., 2017) for more information on the data sets. 127