Analysis of Lossless Compression for a Large Class of Sources of Information - Part I Valeriu Munteanu 1 , Daniela Tarniceriu 1 , Gheorghe Zaharia 2 , 1 Faculty of Electronics, Telecommunications and Information Technology, Technical University “Gheorghe Asachi” Iasi, Romania, e-mail: vmuntean@etc.tuiasi.ro 2 IETR – INSA, UMR CNRS 6164 Rennes, France, e-mail: Gheorghe.Zaharia@insa-rennes.fr AbstractWe analyze the lossless compression for a large class of discrete complete and memoryless sources performed by a generalized Huffman with an alphabet consisting of M letters. Given the number of source messages, N, the alphabet size, M, and the number of code words, p, on each level in the graph, excepting the last two ones, we have determined the unknown encoding parameters, that is, the number n of the levels in the encoding graph, the number q of code words on the level n-1, the number k of groups of M nodes, and the remaining m nodes on the last level. The average code word length is also computed. Two extreme cases, when p=0 and p=M-1 have been analyzed. I. INTRODUCTION During last years, storage and transmission capacities have grown considerably. However, due to practical demands, large files still have to be compressed before being stored or transmitted. Compression algorithms [1], [2], [3] represent an input signal X as another signal Xc using less bits than the original. The reconstruction or decompression algorithm operates on X c to estimate the original signal, providing X . In lossless compression it is necessary that X X [4]. This kind of compression is required for signals like text, computer data and several image types, because the exact reconstruction is crucial [5], [6], [7]. The most used lossless compression techniques are Huffman encoding, arithmetic compression and dictionary techniques [8], [9], [10]. In this paper the Huffman compression technique is analyzed for a large class of information sources. For each analyzed case, the encoding graph, the corresponding encoding parameters and the average code word length are derived. II. ENCODING GRAPH AND CORRESPONDING PARAMETERS We assume a source which can deliver N messages. It has such a distribution so that as result of a Huffman encoding, in the general case of an alphabet consisting of M letters, ( 2 M ), a tree graph as that in Fig. 1 results. In this graph, on each level, p (0 1) p M code words are placed, excepting the last two levels. For the sake of generality we assume that on the level n-1, p+q code words are placed and on the level n, a number of kM+m code words (2 ) m M . Figure 1. The encoding graph for 0 1 p M The following quantities are assumed known before starting the encoding process: N, the number of source messages to be encoded; M, the cardinality of the code alphabet; p, the number of code words on each level in the graph, excepting the two last ones. The unknown parameters are: n, the number of levels in the encoding graph; q, the number of code words on the level n-1; Parameters k and m. To determine the number of levels in the encoding graph, n, we take into account that the number of nodes on the first level is 1 N M (1) The number of nodes on the second level is 2 2 ( ) N M pM M pM (2) The number of nodes on the third level is 3 2 3 N M pM pM (3) Similarly, the number of nodes on level n-2 is 2 2 3 2 ( 1) ... 1 n n n n M M p pM N M pM pM M (4) and for level n-1,