Literary and Linguistic Computing, Vol. 18, No. 3 © ALLC 2003; all rights reserved 249 1 Introduction Although figurative language is very frequent overall in both written and spoken language (Sinclair, 1991; Gibbs, 1994), (corpus-based) research on idioms faces a number of obstacles. There is no doubt about the import- ance of idioms as a whole in texts (especially in written journalism; Moon 1998a), but if one is looking for a particular verbal idiom (e.g. bite the dust, spill the beans) in a given (multi-million word) corpus, its relative frequency will seldom exceed one per million words (Moon, 1998b). Also, Nicolas (1995, p. 233) found that ‘contrary to received views, at least 90 per cent of V-NP idioms (…) appear to allow some form of (syntactically) internal modification’ (e.g. be grist for the mill becoming be grist for the linguistic mill). In addition, idiomatization appears to be one of the principal factors in the evolution of language (Chafe, 1970). In Correspondence: Liesbeth Degand, Université catholique de Louvain, Place B. Pascal, 1, B-1348 Louvain-la-Neuve, Belgium. E-mail: degand@lige.ucl.ac.be Towards Automatic Retrieval of Idioms in French Newspaper Corpora Liesbeth Degand and Yves Bestgen Université catholique de Louvain, Louvain-la-Neuve, Belgium Abstract The goal of this paper is to present a procedure for the automatic retrieval of idiomatic expressions from large text corpora. The procedure combines text segmentation techniques and Latent Semantic Analysis. Three indices were computed on the basis of the three-fold hypothesis that: (1) idiomatic expres- sions should have few neighbours; (2) idiomatic expressions should demon- strate low semantic proximity between the words composing them; (3) idiomatic expressions should demonstrate low semantic proximity between the expression and the preceding and subsequent segments. The result of this procedure shows that we have not yet reached a fully automatic retrieval of idioms from large corpora, but this first trial has shown that we are on the way. The procedure reduces the amount of data to consider to less than a quarter (23.8 per cent) of the original data, of which one-fifth (20.9 per cent) is idiomatic, and nearly 60 per cent (58.8 per cent) is phraseological in nature. In other words, this procedure drastically improves and facilitates hand-based retrieval. In addition, these first results already permit some linguistic exploitation of the retrieved idioms.