An Intelligent System for Document Retrieval in Distributed Office Environments * Uttam Mukhopadhyay,t Larry M. Stephens, Michael N. Huhns,:j: and Ronald D. Bonnell Center for Machine Intelligence, University of South Carolina, Columbia, SC 29208 MINDS (Multiple Intelligent Node Document Servers) is a distributed system of knowledge·based query engines for efficiently retrieving multimedia documents in an of- fice environment of distributed workstations. By learning document distribution patterns, as well as user interests and preferences during system usage, it customizes document retrievals for each user. A two-layer learning system has been implemented for MINDS. The knowl· edge base used by the query engine is learned at the lower level with the help of heuristics for assigning credit and recommending adjustments; these heuristics are in- crementally refined at the upper level. 1. Introduction Documents are used in computerized office environ- ments to store a variety of information. This information is often difficult to utilize, especially in large offices with distributed workstations, because users do not have per- fect knowledge of the documents in the system or of the or- ganization for their storage. The goal of the MINDS project is to develop a distributed system of intelligent servers that (1) learn dynamically about document storage patterns throughout the system, and (2) learn interests and prefer- ences of users so that searches are efficient and produce relevant documents [1,2]. The strategy adopted for eval- uating a set of learning heuristics that are applicable to this goal is presented. In particular, this paper describes the heuristic evaluation testbed, distance measures for metaknowledge, document migration heuristics, evidence assimilation techniques, and results of a system simulation. *This research was supported in pati by NCR Corporation. tNew address: Computer Science Department. General Motors Re- search Laboratories. Warren, MI 48090. ~New address: Artificial Intelligence Department, Microelectronics and Computer Technology Corporation. 9430 Research Boulevard, Austin, TX 78759. Received June 17. 1985; accepted August 30,1985. © 1986 by John Wiley & Sons. Inc. 2. Distributed Workstation Environment A. Organization of Documents Queries regarding documents are frequently based on the contents of the documents. Automatic text-under- standing systems could conceivably process these queries by reading the documents, but would be expensive to de- velop and use. The names of documents provide clues to their contents, but names are not descriptive enough for reliable processing of content-based queries, However, a set of keywords may be used to describe document con- tents: the retrieval of documents can then be predicated on these keywords as well as on other document attri- butes, such as author, creation date, and location. Com- plex qualifiers, which are conjunctions or disjunctions of predicates on these attributes, may also be used. Each document is thus represented by a surrogate containing its attributes. The document and its surrogate are subse- quently updated or deleted as dictated by system usage. Surrogates occupy only a fraction of the storage space re- quired by the documents, but usually contain enough in- formation for users to determine whether a document is useful. The presumed office environment consists of a network of single-user workstations. Each user may query the sys- tem about his own locally-stored documents or about those stored at other workstations. These documents are not permanently located but may migrate to other work- stations. Multiple copies of documents are allowed, but documents stored at one location must have unique names. B. The User's Perspective In typical distributed document management systems, document directories are either centralized or distributed, with or without redundancy [3]. However, the directory JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 37(3): 123-135, 1986 CCC 0002-8231/86/030123-13$04.00