Category: High Performance Computing
Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
3590
Techniques for Specialized
Data Compression
INTRODUCTION
In today’s world, the continuing improvement in
data storage technologies makes enormous storage
capacities available to users. For instance, during the
1996-2010 period, the average storage capacity of
desktop personal computer drive increased by 375
times (Adams, 2012). Even so, the growth of available
storage capacity is still outpaced by the growth of the
information produced worldwide, especially that a
gigabyte of stored content can generate a much higher
volume of transient data that is not typically stored, but
is often transmitted (Gantz & Reinsel, 2011). Hence the
need to conserve not only data storage space but also
data transmission bandwidth. This need is answered
by data compression.
Data compression is “the process of converting
an input data stream into another data stream that has
a smaller size” (Solomon, 2007). Data compression
is possible thanks to redundancy of data; it makes
use of the fact that some portions of the input stream
need not to be stored, as they may be recreated given
the remaining parts of the stream, and/or the fact that
some portions of the data are either not relevant to the
user at all, or their relevance is negligible. Obviously,
pursuing the latter option usually leads to achieving
much higher compression ratios, but results in a loss of
information qualified as irrelevant during compression,
which may be considered as relevant in the future, for
another use, or by another user. As there is extensive
literature devoted to lossy data compression (see e.g.,
Sayood, 2012, and references therein), this article
describes only lossless methods.
The lossless data compression methods can be
classified into two types. The general-purpose methods
use a general model that adapts to its input, and thus
manage to compress various types of data. The special-
ized methods are designed to process only one type of
data (defined more or less narrowly). Thus, they can
not only start compression with a model prepared for
data of that specific type, but also exploit redundancy
specific only to that type, which would be invisible to
a general-purpose method.
Contemporary data compression methods typi-
cally combine a set of techniques to achieve superior
compression ratios. The aim of this article is to review
such component techniques, useful for specialized
compression of various types of data. First, however,
the Background section gives some information on
how most popular general-purpose data compression
methods work.
BACKGROUND
The base technique of data compression, that is in-
cluded in most contemporary compression methods,
is statistical coding. It uses statistics of occurrence of
respective symbols in the input stream to minimize
the size of the output stream. The most widely-used
technique of this kind is Huffman coding, which as-
signs short codewords to frequent input symbols, and
long codewords to rare symbols (Huffman, 1952). The
Huffman coding is optimal in the sense that no other
codeword assignment could produce shorter output
stream. Further improvement is still possible by assign-
ing value ranges, instead of individual codewords, to
input symbols. Such approach is taken by the arithmetic
coding, where the entire input stream is encoded with a
binary fraction representing its cumulative probability
(Rissanen, 1976).
Most real-world data types exhibit some form
of correlation between symbols within them, which
cannot be exploited by mere counting occurrences
of individual symbols in the input stream. There are
Jakub Swacha
University of Szczecin, Poland
DOI: 10.4018/978-1-4666-5888-2.ch351