Perfect Encoding: a Signature Method for Text Retrieval D. Dervos 1,2 P. Linardis 1 Y. Manolopoulos 1 1 Dept. of Informatics, Aristotle University, 540 06 Thessaloniki, Greece e-mail: {ddervos,manolopo}@athena.auth.gr 2 Dept. of Informatics, Technology Educational Institute, 541 01 Thessaloniki, Greece Abstract A new methodology is introduced, where blocks of text are re- placed by a compressed, fully reversible, signature pattern. Full reversibility implies zero information loss, thus the new method is termed Perfect Encoding. The method’s analytical model is pro- duced and, where applicable, contrasted with the current practice in signature file organizations. Analysis results indicate that it com- prises a potential candidacy for information retrieval implementa- tions. In particular, perfect encoding has the potential to develop into an alternative or complementary scheme to inverted or signa- ture file based systems. 1 Introduction Free text indexing methodologies, like the inverted file and signa- ture file approaches, enjoy applicability in the modern Information Retrieval (IR) environment [6, 4]. The inverted file approach is characterized by its efficiency in text retrieval operations whereas the Signature File (SF) involves a simple structure and requires sig- nificantly less storage overhead. The Superimposed Coding Signa- ture File (SC-SF) comprises the most widely used signature file variation. SC-SF is applied in indexing objects for a variety of text- based applications [13, 7, 11]. SC-SF considers the textbase to consist of a number of logical blocks, each block involving a constant, pre-specified, number of distinct, non-common words (D). An F -bit signature or descrip- tor, consisting of m ones (1s) set in the [1...F ] range, is associated with each word in the logical block. For each block of text, its D word signature patterns are bit-ORed and produce an F -bit block signature pattern. Block signature patterns are used as an interme- diary, compressed representation for text indexing purposes. The SC-SF intermediary representation utilizes nearly 10% of the storage utilized by the corresponding text base. This is a signif- icant improvement over the inverted file environment which calls for a storage overhead of 100% or larger [3, 10]. In this respect, SC-SF comprises a compromise between inverted file and full-text scanning methods. Its structure is highly modular and allows for efficient query processing in parallel architectures [8, 14, 9]. A desirable development would be to combine the speed and the exactness of the inverted file organization with the low stor- Proceedings of the International Workshop on Advances in Databases and Information Systems (ADBIS’96). Moscow, September 10–13, 1996. Moscow: MEPhI, 1996. age overhead, modularity and simplicity of the signature file. The current study comprises an effort in the direction of improving the signature file organization by achieving a higher text compression rate as well as by eliminating the information loss involved. Figure 1: SC-SF example with F =12, D=2, m=4 Figure 1 comprises a simplified (F =12, D=2, m=4) configu- ration. A two word (documents, collection) logical block of text is considered, together with its 12 bit long block signature pattern. When four single word queries (documents, collection, book and text) are processed against the intermediary binary representation of text, the latter is successful in predicting the presence of docu- ments and collection: every single 1 in each word signature pattern has the corresponding bit position in the block signature pattern set to 1, too. The scheme is also successful in predicting the absence of the word file: bit position number 10 is set to 1 in the word sig- nature pattern but registers a 0 in the block signature. In the case of the single word query book, the SC-SF configu- ration in Figure 1 is seen to introduce information loss. The word is erroneously taken to be present in the block, due to the 1s of its signature corresponding to 1s set by either one of the two other words present. Searching for the word book is said to result into a False Match or a False Drop, thus being indicative of the infor- mation loss introduced by the SC-SF text compression stage. The rate at which false matches occur is reflected by the False Drop Probability (FDP) metric [4]: 176