Compression of Persian Text for Web-Based Applications, Without Explicit Decompression FATTANE TAGHIYAREH, EHSAN DARRUDI Department of ECE Engineering University of Tehran POBOX 14395-515, North Kargar Ave, Tehran IRAN FARHAD OROUMCHIAN University of Wollongong in Dubai POBOX 20183, Dubai UAE NEEYAZ ANGOSHTARI Computer Science & Engineering Michigan State University East Lansing, MI 48824-1226 USA Abstract: The increasing importance of Unicode for text files implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. The approach presented in this paper aims to reduce the storage and the transmission time for Farsi text files in web-based applications and Internet. The basic idea here is to compute the most repetitive n-grams in the Farsi text and replace them by a single character. The new characters then will be placed in the user-defined sections of the Unicode and proper glyphs will be created for them. In this approach the compression will be done on the server side once and the decompression process is eliminated completely. The rendering process in the browser will do the decompression. There is no need for any additional program or add-ins for decompression to be installed on the browser or client side. The user needs only to download the proper Unicode font once. A genetic algorithm is utilized to select the most appropriate n-grams of different sizes in order to maximum the amount of compression. In the best case, we have achieved 52.26 % reduction of the file size. The method is general, and applies equally well to English and other languages. Key-Words: N-gram Compression, Text Compression, Farsi, Unicode, Font, Genetic Algorithm 1 Introduction With the increase in the amount of non-English or Non-Latin text in the Internet, there is a growing need for the compression algorithms that can use the characteristics of these languages. Most of the algorithms that have been developed for text compression are adaptive dictionary-based compression algorithms [1] [2] [3]. They use previously encountered words to build one or more dictionaries [4] and output a series of tokens as the text is processed. These tokens are used instead of the words when they are encountered, reducing the storage space, but they require considerable amount of computing resources at the decompression end. Unicode [5] is an encoding system that provides a unique number for every character, used in all the major languages written today. The original goal is to use a single 16-bit encoding that provides code points for more than 65,000 characters. The majority of common-use characters fit into the first 64K code points, an area of the code space that is called the basic multilingual plane or BMP for short. There are about 6,700 unused code points for future expansion in the BMP, plus over 870,000 unused supplementary code points on the other planes. So many characters are