Computers and the Humanities, Vol. 10, pp. 69-87, PERGAMON PRESS, 1976. Printed in the U.S.A. On the Role of Words and Phrases in Automatic Text Analysis G. SALTON AND A. WONG IN automatic information retrieval, the first and possibly most crucial operation consists in assigning to the stored documents and to incoming user queries appropriate identifiers capable of representing infor- mation content. The difficulty of the task is illus- trated by the fact that in practically all existing operational retrieval systems this indexing operation is carried out manually by trained indexers or subject experts, rather than automatically. Fully automatic indexing methods in which a computer is used to generate and assign these content identifiers are restricted to special, mostly laboratory-type environ- ments [1,2]. Quite a few observers claim, in fact, that such an indexing operation (that is, automatic reduc- tion of written texts to individual "units of expres- sion" for the representation of content) is inherently impossible, irrespective of the manner in which it is carried out, and that retrieval, or text processing systems based on such content identifiers can never operate satisfactorily. Bar-Hillel, for example, asserts that It is a logical category mistake to assume that a word, o~ a phrase, contains information in the same sense in which a statement does. In spite of prima facie appeal, the information content of a statement is not the sum, or combination, of the information content of its constituent phrases .... [3] This type of argument leads to the conclusion that index terms and phrases are not substitutes for more complete content identifications, and that term sets cannot therefore function as miniature or condensed documents. The latter assertion is reinforced by showing that the juxtaposition of terms in the language is in no way comparable to the intersection, or union, of the document sets identified by the corresponding terms. Thus, the set of documents related to "fistl food" is not identical with the set identified by both "fish" and "food," and neither is a "Newfoundland dog" characterizable by the term pair "Newfoundland" and "dog" [3]. To replace the reduction of written texts to a set of simple terms only, a full theory of language understanding appears' to be needed which would account for the complete stated and implied content of the texts. Such a theory of language understanding should be capable of identifying not only an appro- priate set of content indicators, but also two main types of relations between indicators: a) the logical-semantic relations between text units which are dependent on the world knowledge and on the social context within which a given area of discourse is placed; b) the linguistic-semantic relations which are depen- dent on the linguistic context and are derivable from a knowledge of the "deep" structure of the texts [4]. For determining the logical-semantic relations, an encyclopedia or semantic net is often suggested to identify the scope and extent of a given subject area, and the known relationships between the concepts included in the field. The linguistic-semantic rela- tions, on the other hand, are obtainable by using a combined syntactic-semantic analysis to generate a detailed structure of the written texts. When the world (encyclopedic) knowledge is combined with the linguistic analysis, a text should then be repre- sentable as a series of "inference chains" representing the line of thought expressed in the text, including both the stated as well as the unstated assumptions and conditions [5]. Unfortunately, a large number of unresolved prob- lems interfere at present with the utilization of language understanding systems. There is uncertainty Gerard Salton is the chairman of the computer science department, Cornell University, and Anita Wong is a graduate student of computer science there. This study was supported in part by the National Science Foundation under grant GJ 43505. 69