A MODULAR APPROACH TO DOCUMENT INDEXING AND SEMANTIC SEARCH Dhanya Ravishankar, Krishnaprasad Thirunarayan, and Trivikram Immaneni Department of Computer Science and Engineering Wright State University, Dayton, OH-45435. dhanyars@yahoo.com , t.k.prasad@wright.edu , timmanen@cs.wright.edu http://www.cs.wright.edu/~tkprasad ABSTRACT This paper develops a modular approach to improving effectiveness of searching documents for information by reusing and integrating mature software components such as Lucene APIs, WORDNET, LSA techniques, and domain-specific controlled vocabulary. To evaluate the practical benefits, the prototype was used to query MEDLINE database, and to locate domain-specific controlled vocabulary terms in Materials and Process Specifications. Its extensibility has been demonstrated by incorporating a spell-checker for the input query, and by structuring the retrieved output into hierarchical collections for quicker assimilation. It is also being used to experimentally explore the relationship between LSA and document clustering using 20-mini-newsgroups and Reuters data. In future, this prototype will be used as experimental testbed for expressive, context-aware and scalable searches. KEY WORDS Search and Querying, Tools, Latent Semantic Indexing, Domain-Specific Search, Modular Search Engine, Document Clustering 1. Introduction The state-of-the-art search engines (such as Google) provide a scalable solution for flexible and efficient search of Web documents, capturing collective Web “wisdom” to rank order the retrieved documents. This approach to search cannot always be expected to work well for documents pertaining to specialized domains where implicit background knowledge and vocabulary can be exploited to improve the accuracy of the retrieved results [1]. In general, precision can be improved through disambiguation, and recall can be improved by considering meaning preserving query variations [2][3][4]. Verbatim searches can be generalized in a number of directions such as by using information implicit in the English language and in the document collection. Eliminating stop words and affixes, proximity based searches, etc can capture semantic invariance due to word inflection and permutations, improving recall. English language synonyms can be used to improve recall, but including synonyms for all possible senses can adversely affect precision. Latent Semantic Analysis approach effectively regroups the document collection on the basis of occurrences of correlated words inferred from the document collection, so that some documents that lack the query words may be retrieved, and other documents that happen to contain the query words in a different context may be skipped [5][6]. In this paper, we investigate systematic generalization of keywords-based syntactic queries to concept-based semantic queries by utilizing linguistic information (such as synonyms) available explicitly, and domain-specific information (such as term correlations or associations) available implicitly in the document collections and explicitly through controlled vocabularies. Furthermore, it is important to locate and highlight the query hits in the context of a document in order to enable access to relevant portions of the document (because the user may not be aware of the automatically included context). In order to ensure that the search tool is efficient, flexible, and usable in practice, and extensible, customizable, and evolvable in future, mature software components have been employed in developing the infrastructure. The content indexing and intelligent search tool discussed above has been put to novel use in performing domain- specific information extraction from documents (for example, Materials and Process Specifications), by exploiting it for semi-automatic mapping of document phrases to controlled vocabulary terms. That is, one can (i) determine all controlled vocabulary terms that can (partially) match a query phrase, (ii) determine all controlled vocabulary terms that appear in a document and locate the corresponding document phrases, and (iii) determine all partially matching controlled vocabulary terms that can potentially be extended to match a document phrase, deserving further human intervention for disambiguation. To demonstrate the extensibility of the tool for improving user experience with respect to query input and display of query response, a spell-checker module and a simple technique for organizing search results into finer groups respectively has been incorporated. It is also being used to