IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 4, DECEMBER 2010 837
Identifying and Resolving Hidden Text Salting
Marie-Francine Moens, Jan De Beer, Erik Boiy, and Juan Carlos Gomez
Abstract—Hidden salting in digital media involves the inten-
tional addition or distortion of content patterns with the purpose
of content filtering. We propose a method to detect portions of
a digital text source which are invisible to the end user, when
they are rendered on a visual medium (like a computer monitor).
The method consists of “tapping” into the rendering process and
analyzing the rendering commands to identify portions of the
source text (plaintext) which will be invisible for a human reader,
using criteria based on text character and background colors, font
size, overlapping characters, etc. Moreover, text deemed visible
(covertext) is reconstructed from rendering commands and then
the character reading order is identified, which could differ from
the rendering order. The detection and resolution of hidden salting
is evaluated on two e-mail corpora, and the effectiveness of this
method in spam filtering task is assessed. We provide a solution to
a relevant open problem in content filtering applications, namely
the presence of tricks aimed at circumventing automatic filters.
Index Terms—Content filtering, content manipulation.
I. INTRODUCTION
S
ALTING is the intentional addition or distortion of content
patterns in a digital source for reasons of evasion of auto-
mated content analysis and filtering. We make a distinction be-
tween surface salting (e.g., images containing random, anoma-
lous pixel dots) and hidden salting (e.g., text displayed with in-
visible ink), depending on whether the salting is respectively vi-
sually perceivable by the user of the content or not. In this paper,
we only consider hidden salting of textual data, where text is en-
coded into computer-readable formats (e.g., ASCII, UniCode,
HTML). With text, we mean any sequence or constellation of
characters in a particular writing system, meant for human in-
terpretation in a communicative act. Hidden salting is the most
dangerous in the context of fraudulent schemes and is, for in-
stance, found in phishing e-mails that aim at stealing personal
information, which can be used to commit identity theft [1]–[3].
Salting in digital content is a phenomenon that only recently has
drawn scientific attention. However, given the increasing usage
Manuscript received December 23, 2009; revised June 22, 2010; accepted
July 09, 2010. Date of publication August 03, 2010; date of current version
November 17, 2010. The associate editor coordinating the review of this manu-
script and approving it for publication was Prof. Ton Kalker.
M.-F. Moens and E. Boiy are with the Department of Computer Sci-
ence, Katholieke Universiteit Leuven, Heverlee B-3001, Belgium (e-mail:
sien.moens@cs.kuleuven.be; erik.boiy@cs.kuleuven.be).
J. De Beer was with Katholieke Universiteit Leuven, Heverlee B-3001, Bel-
gium. He is now with IBM, Brussels B-1130, Belgium.
J. C. Gomez was with Katholieke Universiteit Leuven, Heverlee B-3001,
Belgium. He is now with ITESM, Monterrey 64710, Mexico (e-mail:
juancarlos.gomez@invitados.itesm.mx).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIFS.2010.2063024
and importance of automated content filtering, e.g., on intercon-
nected networks including the Internet, it can be expected that
salting becomes more widespread and sophisticated.
This paper proposes a method for detecting hidden or dis-
torted content in a digital text source that is not perceived by the
human end user when the plaintext (i.e., the literal, original, ma-
chine-readable version, e.g., an HTML source) of the digital text
is rendered through text (re)production processes on output de-
vices (e.g., the user’s monitor screen) (see Fig. 1). The method
generates the covertext of the digital text, which is defined as the
truly perceived content of the source. The paper gives evidence
that hidden salting can be correctly detected and resolved by
looking not only at the source text, but also at what a user per-
ceives. The proposed method consists of tapping into the ren-
dering process of the source text and analyzing the rendering
commands to identify portions of the text which are invisible
for a human reader, using criteria based on text character and
background colors, font size, overlapping characters, etc. More-
over, the text deemed visible (i.e., the covertext) is reconstructed
from the visible characters by identifying the character reading
order which could differ from the rendering order stated in the
source. This cognitive model of the text’s reading order relies on
several language models (i.e., statistical models of text written
in a certain language) obtained from a large collection of texts.
Reconstructing the reading order of characters often implies the
segmentation of the text in blocks with a uniform reading order.
The computation of this segmentation is seen as a greedy search
process among different possible segmentations, but can still be
computationally expensive for complex messages that contain
many tables and frames. Nevertheless, this method evidences
the character sequences in the source that are manipulated. This
is valuable as the filtering replaces expensive manual analyses,
for instance when inspecting e-mails and building rules for spam
filtering.
We evaluate the detection and resolution of hidden salting in
a set of e-mails, we generate statistics on hidden salting in two
e-mail corpora, and we assess the effectiveness of the restored
text in an e-mail filtering task.
The contribution of this work regards technologies for de-
tecting the presence of salting and the resolution to what the
user of the content really perceives. Our findings overturn the
traditional definition of text currently in use when processing
digital media, where a text is seen as a sequence of word tokens
and each word consists of a sequence of characters. Because of
its communicative function, a text—in our view—is defined by
what a user perceives, no matter how it is now or in the future
digitally constructed. The digital textual source gives us addi-
tional information on how the text is constructed and possibly
manipulated. This aspect provides a timeless dimension to our
research and transcends applications such as e-mail filtering.
1556-6013/$26.00 © 2010 IEEE