Sorting Out the Document Identifier Assignment Problem Fabrizio Silvestri Institute for Information Science and Technologies ISTI - CNR, via Moruzzi, 1, 56126 Pisa, Italy Italy fabrizio.silvestri@isti.cnr.it Abstract. The compression of Inverted File indexes in Web Search En- gines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory. 1 Introduction Indexes in Web Search Engines (WSEs) are usually represented using the pop- ular Inverted File (IF) data structure [15]. Given a set of documents, an IF is composed by two distinct sets: the Lexicon and the Posting Lists. The Lexicon represents the set of terms that can be found within the whole document set. To each term of the lexicon a Posting List is associated containing information (the so-called posting ) on all the documents containing that term. For example, the index entry <t 1 ; 5; 3, 4, 10, 20, 23 > states that term t 1 (stored within the Lexi- con) appears in five documents, namely 3, 4, 10, 20, and 23. The set containing all these lists is stored within the Posting Lists section. One of the main reasons why IFs (or one of their variations) are usually adopted in real world WSEs, is that they can be easily compressed to reduce memory occupancy. Compressing indexes in WSEs has been also proved to en- hance efficiency of the retrieval process [2, 11, 14]. A reduction in space occu- pancy, in fact, usually corresponds to a better utilization of the memory hierar- chy. The majority of the techniques adopted for compressing IFs are based on their d-gapped representation [15]. Posting lists are usually scanned sequen- tially. For this reason, it is possible to represent those lists by taking differences