Comparing Inverted Files and Signature Files for Searching a Large Lexicon BEN CARTERETTE 1 , FAZLI CAN 2 Computer Science and Systems Analysis Department Miami University, Oxford, OH 45056; December 4, 2003 To appear in Information Processing and Management Abstract Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium. Keywords: Compression, Dictionaries, Indexing Methods, Personal Digital Assistants (PDAs), Performance Evaluation. 1. Introduction Searching a large lexicon is a fundamental activity in information retrieval: the first step in resolving a query to a document collection index is finding query terms in a lexicon (Witten et al., 1999; Baeza-Yates & Ribeiro-Neto, 1999). Lexicons, being relatively small, can be stored in main memory and searched very fast, but it is worth considering the gains that indexing the lexicon separately might give. A lexicon index could allow for partially-specified query terms (e.g. terms with a wildcard character representing multiple unspecified characters) by pattern matching. Partially-specified terms have uses in cross-language retrieval (Guthrie et al., 1996), spell checking (Kukich, 1992), approximate matching (Zobel & Dart, 1994), crossword puzzle generation (Harris et al., 1993; Harris et al., 1992), and library catalog retrieval (Crane, 1996), to name a few. Another application is query expansion: a preprocessing step could use a lexicon search to translate a pattern into a set of terms for a disjunctive search. Glimpse (Manber & Wu, 1993) uses approximate matching for file system search. Partial and approximate matching have 1 Now at the Center for Intelligent Information Retrieval, University of Massachusetts, Amherst, MA 01003. e-mail: carteret@cs.umass.edu. 2 Corresponding author, Computer Science and Systems Analysis Department, Miami University, Oxford, OH 45056, e-mail: canf@muohio.edu, voice: +1 (513) 529-5950, fax: +1 (513) 529-1524.