Rapid lossless compression of short text messages
Kenan Kalajdzic
a,
⁎, Samaher Hussein Ali
c
, Ahmed Patel
a,b
a
School of Computer Science, Centre of Software Technology and Management (SOFTAM), Faculty of Information Science and Technology (FTSM), Universiti Kebangsaan Malaysia, UKM Bangi,
43600 Selangor Darul Ehsan, Malaysia
b
School of Computing and Information Systems, Faculty of Science, Engineering and Computing, Kingston University, Penrhyn Road, Kingston upon Thames KT1 2EE, United Kingdom
c
Department of Information Network, Faculty of Information Technology (IT), University of Babylon, Babylon 00964, Iraq
abstract article info
Article history:
Received 28 November 2012
Received in revised form 27 May 2014
Accepted 28 May 2014
Available online 6 June 2014
Keywords:
Data compression
Lossless compression
Short text messages
SMS
In this paper we present a new algorithm called b64pack
1
for compression of very short text messages. The
algorithm executes in two phases: in the first phase, it converts the input text consisting of letters, numbers,
spaces and punctuation marks commonly used in English writings to a format which can be compressed in the
second phase. The second phase consists of a transformation which reduces the size of the message by a fixed
fraction of its original size. We experimentally measured both the compression speed and the compression
ratio of b64pack on a large number of short messages and compared them with compress, gzip and bzip2,
three most common UNIX compression programs. We show that in case of short text messages up to a certain
size b64pack achieves better compression than any of the three programs. With respect to speed, b64pack
beats all three algorithms by orders of magnitudes. This rapid compression is one of the key strengths of
b64pack.
© 2014 Elsevier B.V. All rights reserved.
1. Introduction
Until recent years, most algorithms for text compression were
primarily concerned with compressing large inputs. Fast adoption
of SMS messaging and Internet services based on short messages
(e.g. Twitter, chat) has caused an increased interest in compression
of very short texts. Interestingly, though, publications concerning com-
pression of short messages are relatively scarce.
Why is compression of short messages necessary? Given the high
volume of SMS, Twitter and instant messaging traffic, compression of
short text messages can bring tremendous savings in network band-
width. Could not multiple messages be first buffered to form a larger
chunk of data and then compressed with a regular compression
algorithm to achieve better results? The answer is: For realtime com-
munication, such as instant messaging or chat, buffering of multiple
messages is not possible, since each message has to be sent indepen-
dently and immediately after it is typed. Therefore we need a mecha-
nism to compress each of these short messages individually.
In case of SMS messages, a system called concatenated SMS has been
developed to extend the inherent limit of an SMS message. It works by
breaking a long message into smaller parts and sending each of them as
a single SMS message. At the receiving end the short messages are
combined back to one long message. One downside of concatenated
SMS is that, if the length of an SMS message exceeds 140 bytes, the
user is usually charged for two SMS messages, even if the excess is
only a few characters long.
In this paper we introduce a new algorithm called b64pack for effi-
cient compression of very short text messages. In contrast with other
major works in short text compression, such as [1–3], which focus on
certain limitations of prediction by partial matching (PPM) compression
and provide ways to improve it, we follow a different approach.
To facilitate an easy deployment and interoperability across billions
of computers, mobile and embedded devices, we propose a compres-
sion scheme which relies on a straightforward use of standard open
source software libraries available on all operating systems. The use of
b64pack does not require any proprietary software components or
algorithms. We compare b64pack with other standard compression
algorithms implemented by programs such as compress, gzip and
bzip2 to demonstrate how applications and users could directly benefit
from using b64pack for compression of short messages. Our research
objective was to prove that b64pack is able to overcome certain
major drawbacks of existing SMS services. We did not specifically set
out or purport to evaluate against other data compression schemes,
and have used them merely as a reference for comparison.
The key features of b64pack are:
• extremely low memory requirements—a message compressed with
b64pack requires no header/metadata, while in the base case lookup
tables used by b64pack together occupy less than 256 bytes of
memory;
Computer Standards & Interfaces 37 (2015) 53–59
⁎ Corresponding author.
E-mail addresses: kenan@unix.ba (K. Kalajdzic), samaher@itnet.uobabylon.edu.iq
(S.H. Ali), whinchat2010@gmail.com (A. Patel).
1
b64 stands for BASE64.
1
b64 stands for BASE64.
http://dx.doi.org/10.1016/j.csi.2014.05.005
0920-5489/© 2014 Elsevier B.V. All rights reserved.
Contents lists available at ScienceDirect
Computer Standards & Interfaces
journal homepage: www.elsevier.com/locate/csi