Term conflation methods in information retrieval Non-linguistic and linguistic approaches Carmen Galvez, Fe ´lix de Moya-Anego ´n and Vı ´ctor H. Solana Department of Information Science, University of Granada, Granada, Spain Abstract Purpose – To propose a categorization of the different conflation procedures at the two basic approaches, non-linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques. Design/methodology/approach – Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non-linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern-matching, through equivalence relations represented in finite-state transducers (FST), are emerging methods for the recognition and standardization of terms. Findings – The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method. Originality/value – Outlines the importance of FSTs for the normalization of term variants. Keywords Information retrieval, Document management, Indexing, Variance reduction Paper type Conceptual paper Introduction In many information retrieval systems (IRS), the documents are indexed by uniterms. However, uniterms may result ambiguous, and therefore unable to discriminate only the pertinent information. One solution to this problem is to work with multiterms (multi-word terms or phrases) often obtained through statistical methods. The traditional IRS approach is based on this type of automatic indexing technique for representing documentary contents (Salton, 1980, 1989; Croft et al., 1991; Frakes and Baeza-Yates, 1992). The concepts behind such terms can be manifested in different forms, known as linguistic variants. The variants are defined as a text occurrence that is conceptually related to an original term. In order to avoid the loss of relevant documents, an IRS recognizes and groups variants by means of so-called conflation methods, or term normalization methods. The process of conflation may involve linguistic techniques such as the segmentation of words and the elimination of affixes, or lexical searches through thesauri. The latter is concerned with the recognition of semantic variants. The grouping of morphological variants would increase average recall, while the identification and grouping of syntactic variants is determinant in increasing the accuracy of retrieval. One study about the problems involved in using linguistic variants in IRS can be found in Sparck Jones and Tait (1984). The Emerald Research Register for this journal is available at The current issue and full text archive of this journal is available at www.emeraldinsight.com/researchregister www.emeraldinsight.com/0022-0418.htm JDOC 61,4 520 Received March 2004 Revised September 2004 Accepted January 2005 Journal of Documentation Vol. 61 No. 4, 2005 pp. 520-547 q Emerald Group Publishing Limited 0022-0418 DOI 10.1108/00220410510607507