Inform,zl~~~ Processing & Management Vol. 24, No. 1, pp. 17-22, 1988 Printed in Great Britain. 0306.4573/88 $3.00 + .OO 0 1988 Pergamon Journals Ltd. zyxwvutsrqp AN IMPROVED ALGORITHM FOR THE CALCULATION OF EXACT TERM DISCRIMINATION VALUES ABDELMOULA EL-HAMDOUCHI and PETER WILLETT* Department of Information Studies, University of Sheffield, Western Bank, Sheffield SIO 2TN, UK zyxwvutsrqponmlkjihgfedcbaZYXWV (Received 24 Much 1987: accepted 8 M ay 1987) Abstract-The term discrimination model provides a means of evaluating indexing terms in automatic document retrieval systems. This article describes an efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme. The algorithm has an expected running time proportional to Nn2 for a collection of N documents, each of which has been assigned an average of n terms. I. INTRODUCTION The zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA term discrimination model [l, 21 has been suggested as a basis for the evaluation of indexing terms in computerized document retrieval systems. For some term i, the term dis- crimination value LIP’, measures the extent to which the use of i as an indexing term affects the separation of the documents in the multidimensional space defined by the indexing vocabulary. Several studies have demonstrated a strong relationship between term discrimi- nation and term frequency, with the most highly discriminating terms being those of inter- mediate frequencies of occurrence in document collections. We are currently reevaluating the use of the term discrimination model as a basis for automatic indexing strategies. One obvious current limitation of the model is that the cal- culation of the LIx values involves extensive computaiion. For a collection of N docu- ments indexed by a total of M terms the obvious algorithm [3] for the computation of all MDl$ values involves the calculation of O(N’M) interdocument similarity coefficients, and most studies of term discrimination have accordingly used an approximate method for the calculation of the OK values. Willett [3] has recently described an algorithm for the calculation of exact discrimi- nation values that involves the calculation of only 0(N2n) similarity coefficients where n is the mean number of indexing terms assigned to each of the documents in a file. This arti- cle reports a new algorithm for the calculation of exact term discrimination values that has an expected running time of order O(Nrz2). 2. CALCULATION OF TERM DISCRIMINATION VALUES A collection of N documents is assumed to be represented by a series of document vec- tors D/e 1 5 j 5 N. Each such document vector contains M elements, where M is the num- ber of distinct terms that have been used for the indexing of the collection: the ith element, 1 I i I M, dj;, contains the number of occurrences of the ith term in thejth document. In many cases, including all of the seven document collections considered here, the indexing is binary in character so that the d,, values are either 0 or 1. Thus, the collection may be visualized as an N x Mbit matrix where thejth row represents the occurrence of terms in thejth document, and the ith column the occurrences of the ith term in the documents. A measure of the similarity between some pair of documents DJ and Dk may then be calculated using a coefficient such as the cosine coefficient COSJK defined by *To whom all correspondence should be addressed IPM ?i:l-B 17