Lexical idiosyncrasy in MWE extraction Csaba Oravecz, Viktor Nagy and Károly Varasdi Research Institute for Linguistics Hungarian Academy of Sciences {oravecz,nagyv,varasdi}@nytud.hu 1 Introduction A wide scale of different NLP methods have been investigated for the extrac- tion of Multiword Expressions from large corpora. While a good deal of recent research has been focusing on the development of reliable means to delineate dif- ferent subclasses of MWEs with respect to the degree of their compositionality (Baldwin et al., 2003; McCarthy et al., 2003), it has been generally accepted that for the "simple" task of separating MWEs from fully productive word combina- tions, the substitutability of component words in a multiword unit with semantic neighbours could be a good indicative measure (Bannard et al., 2003). The un- derlying assumption is that MWEs do not generally tolerate the replacement of their components with semantically similar items. (Let us call this phenomenon lexical idiosyncrasy.) If we could represent this substitutability by some ranking measure, we will have reliable information whether a word combination could be considered a multiword unit or not (Pearce, 2001). In our paper we will investigate the usability of lexical idiosyncrasy in MWE de- tection/extraction in Hungarian, and try to demonstrate that contrary to intuition, while the above hypothesis based on the lexical idiosyncrasy of MWEs might well be true, it can be problematic to detect it in large corpora with reliability necessary for an efficient extraction method: those methods based on measuring differences in terms of substitutability might not perform as well as "good old" ones based on association measures like some variant of MI or t -score. The remainder of the paper is structured as follows. In section 2 we will give a brief description of the idiosyncratic semantic behaviour of MWEs, which can be utilised (at least in theory) to demarcate MWEs from productive word combina- tions or to identify different MWE classes. Section 3 will discuss the extraction methods we experimented with, while in section 4 we will present the experiments carried out under several scenarios and evaluate how the different techniques per- form. We will not devote a separate section to related work or past research, rather we will be making constant references to resources and methods we slav- ishly adopt throughout the paper. Conclusions and suggestions for further work will end the paper in section 5. 1