Rapid lossless compression of short text messages Kenan Kalajdzic a, , Samaher Hussein Ali c , Ahmed Patel a,b a School of Computer Science, Centre of Software Technology and Management (SOFTAM), Faculty of Information Science and Technology (FTSM), Universiti Kebangsaan Malaysia, UKM Bangi, 43600 Selangor Darul Ehsan, Malaysia b School of Computing and Information Systems, Faculty of Science, Engineering and Computing, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, United Kingdom c Department of Information Network, Faculty of Information Technology (IT), University of Babylon, Babylon 00964, Iraq abstract article info Article history: Received 28 November 2012 Received in revised form 27 May 2014 Accepted 28 May 2014 Available online 6 June 2014 Keywords: Data compression Lossless compression Short text messages SMS In this paper we present a new algorithm called b64pack 1 for compression of very short text messages. The algorithm executes in two phases: in the rst phase, it converts the input text consisting of letters, numbers, spaces and punctuation marks commonly used in English writings to a format which can be compressed in the second phase. The second phase consists of a transformation which reduces the size of the message by a xed fraction of its original size. We experimentally measured both the compression speed and the compression ratio of b64pack on a large number of short messages and compared them with compress, gzip and bzip2, three most common UNIX compression programs. We show that in case of short text messages up to a certain size b64pack achieves better compression than any of the three programs. With respect to speed, b64pack beats all three algorithms by orders of magnitudes. This rapid compression is one of the key strengths of b64pack. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Until recent years, most algorithms for text compression were primarily concerned with compressing large inputs. Fast adoption of SMS messaging and Internet services based on short messages (e.g. Twitter, chat) has caused an increased interest in compression of very short texts. Interestingly, though, publications concerning com- pression of short messages are relatively scarce. Why is compression of short messages necessary? Given the high volume of SMS, Twitter and instant messaging trafc, compression of short text messages can bring tremendous savings in network band- width. Could not multiple messages be rst buffered to form a larger chunk of data and then compressed with a regular compression algorithm to achieve better results? The answer is: For realtime com- munication, such as instant messaging or chat, buffering of multiple messages is not possible, since each message has to be sent indepen- dently and immediately after it is typed. Therefore we need a mecha- nism to compress each of these short messages individually. In case of SMS messages, a system called concatenated SMS has been developed to extend the inherent limit of an SMS message. It works by breaking a long message into smaller parts and sending each of them as a single SMS message. At the receiving end the short messages are combined back to one long message. One downside of concatenated SMS is that, if the length of an SMS message exceeds 140 bytes, the user is usually charged for two SMS messages, even if the excess is only a few characters long. In this paper we introduce a new algorithm called b64pack for ef- cient compression of very short text messages. In contrast with other major works in short text compression, such as [13], which focus on certain limitations of prediction by partial matching (PPM) compression and provide ways to improve it, we follow a different approach. To facilitate an easy deployment and interoperability across billions of computers, mobile and embedded devices, we propose a compres- sion scheme which relies on a straightforward use of standard open source software libraries available on all operating systems. The use of b64pack does not require any proprietary software components or algorithms. We compare b64pack with other standard compression algorithms implemented by programs such as compress, gzip and bzip2 to demonstrate how applications and users could directly benet from using b64pack for compression of short messages. Our research objective was to prove that b64pack is able to overcome certain major drawbacks of existing SMS services. We did not specically set out or purport to evaluate against other data compression schemes, and have used them merely as a reference for comparison. The key features of b64pack are: extremely low memory requirementsa message compressed with b64pack requires no header/metadata, while in the base case lookup tables used by b64pack together occupy less than 256 bytes of memory; Computer Standards & Interfaces 37 (2015) 5359 Corresponding author. E-mail addresses: kenan@unix.ba (K. Kalajdzic), samaher@itnet.uobabylon.edu.iq (S.H. Ali), whinchat2010@gmail.com (A. Patel). 1 b64 stands for BASE64. 1 b64 stands for BASE64. http://dx.doi.org/10.1016/j.csi.2014.05.005 0920-5489/© 2014 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi