Term conﬂation methods in information retrieval Non-linguistic and linguistic approaches Carmen Galvez, Fe ´lix de Moya-Anego ´n and Vı ´ctor H. Solana Department of Information Science, University of Granada, Granada, Spain Abstract Purpose – To propose a categorization of the different conﬂation procedures at the two basic approaches, non-linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques. Design/methodology/approach – Presents a range of term conﬂation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non-linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern-matching, through equivalence relations represented in ﬁnite-state transducers (FST), are emerging methods for the recognition and standardization of terms. Findings – The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conﬂation method. Originality/value – Outlines the importance of FSTs for the normalization of term variants. Keywords Information retrieval, Document management, Indexing, Variance reduction Paper type Conceptual paper Introduction In many information retrieval systems (IRS), the documents are indexed by uniterms. However, uniterms may result ambiguous, and therefore unable to discriminate only the pertinent information. One solution to this problem is to work with multiterms (multi-word terms or phrases) often obtained through statistical methods. The traditional IRS approach is based on this type of automatic indexing technique for representing documentary contents (Salton, 1980, 1989; Croft et al., 1991; Frakes and Baeza-Yates, 1992). The concepts behind such terms can be manifested in different forms, known as linguistic variants. The variants are deﬁned as a text occurrence that is conceptually related to an original term. In order to avoid the loss of relevant documents, an IRS recognizes and groups variants by means of so-called conﬂation methods, or term normalization methods. The process of conﬂation may involve linguistic techniques such as the segmentation of words and the elimination of afﬁxes, or lexical searches through thesauri. The latter is concerned with the recognition of semantic variants. The grouping of morphological variants would increase average recall, while the identiﬁcation and grouping of syntactic variants is determinant in increasing the accuracy of retrieval. One study about the problems involved in using linguistic variants in IRS can be found in Sparck Jones and Tait (1984). The Emerald Research Register for this journal is available at The current issue and full text archive of this journal is available at www.emeraldinsight.com/researchregister www.emeraldinsight.com/0022-0418.htm JDOC 61,4 520 Received March 2004 Revised September 2004 Accepted January 2005 Journal of Documentation Vol. 61 No. 4, 2005 pp. 520-547 q Emerald Group Publishing Limited 0022-0418 DOI 10.1108/00220410510607507