Literary and Linguistic Computing, Vol. 18, No. 3 © ALLC 2003; all rights reserved 249
1 Introduction
Although figurative language is very frequent overall in both written and
spoken language (Sinclair, 1991; Gibbs, 1994), (corpus-based) research on
idioms faces a number of obstacles. There is no doubt about the import-
ance of idioms as a whole in texts (especially in written journalism; Moon
1998a), but if one is looking for a particular verbal idiom (e.g. bite the
dust, spill the beans) in a given (multi-million word) corpus, its relative
frequency will seldom exceed one per million words (Moon, 1998b).
Also, Nicolas (1995, p. 233) found that ‘contrary to received views, at
least 90 per cent of V-NP idioms (…) appear to allow some form of
(syntactically) internal modification’ (e.g. be grist for the mill becoming be
grist for the linguistic mill). In addition, idiomatization appears to be one
of the principal factors in the evolution of language (Chafe, 1970). In
Correspondence:
Liesbeth Degand,
Université catholique de Louvain,
Place B. Pascal, 1,
B-1348 Louvain-la-Neuve, Belgium.
E-mail:
degand@lige.ucl.ac.be
Towards Automatic Retrieval of
Idioms in French Newspaper
Corpora
Liesbeth Degand and Yves Bestgen
Université catholique de Louvain, Louvain-la-Neuve, Belgium
Abstract
The goal of this paper is to present a procedure for the automatic retrieval of
idiomatic expressions from large text corpora. The procedure combines text
segmentation techniques and Latent Semantic Analysis. Three indices were
computed on the basis of the three-fold hypothesis that: (1) idiomatic expres-
sions should have few neighbours; (2) idiomatic expressions should demon-
strate low semantic proximity between the words composing them; (3) idiomatic
expressions should demonstrate low semantic proximity between the expression
and the preceding and subsequent segments. The result of this procedure shows
that we have not yet reached a fully automatic retrieval of idioms from large
corpora, but this first trial has shown that we are on the way. The procedure
reduces the amount of data to consider to less than a quarter (23.8 per cent) of
the original data, of which one-fifth (20.9 per cent) is idiomatic, and nearly
60 per cent (58.8 per cent) is phraseological in nature. In other words, this
procedure drastically improves and facilitates hand-based retrieval. In addition,
these first results already permit some linguistic exploitation of the retrieved
idioms.