IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 4, DECEMBER 2010 837 Identifying and Resolving Hidden Text Salting Marie-Francine Moens, Jan De Beer, Erik Boiy, and Juan Carlos Gomez Abstract—Hidden salting in digital media involves the inten- tional addition or distortion of content patterns with the purpose of content ﬁltering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of “tapping” into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identiﬁed, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam ﬁltering task is assessed. We provide a solution to a relevant open problem in content ﬁltering applications, namely the presence of tricks aimed at circumventing automatic ﬁlters. Index Terms—Content ﬁltering, content manipulation. I. INTRODUCTION S ALTING is the intentional addition or distortion of content patterns in a digital source for reasons of evasion of auto- mated content analysis and ﬁltering. We make a distinction be- tween surface salting (e.g., images containing random, anoma- lous pixel dots) and hidden salting (e.g., text displayed with in- visible ink), depending on whether the salting is respectively vi- sually perceivable by the user of the content or not. In this paper, we only consider hidden salting of textual data, where text is en- coded into computer-readable formats (e.g., ASCII, UniCode, HTML). With text, we mean any sequence or constellation of characters in a particular writing system, meant for human in- terpretation in a communicative act. Hidden salting is the most dangerous in the context of fraudulent schemes and is, for in- stance, found in phishing e-mails that aim at stealing personal information, which can be used to commit identity theft [1]–[3]. Salting in digital content is a phenomenon that only recently has drawn scientiﬁc attention. However, given the increasing usage Manuscript received December 23, 2009; revised June 22, 2010; accepted July 09, 2010. Date of publication August 03, 2010; date of current version November 17, 2010. The associate editor coordinating the review of this manu- script and approving it for publication was Prof. Ton Kalker. M.-F. Moens and E. Boiy are with the Department of Computer Sci- ence, Katholieke Universiteit Leuven, Heverlee B-3001, Belgium (e-mail: sien.moens@cs.kuleuven.be; erik.boiy@cs.kuleuven.be). J. De Beer was with Katholieke Universiteit Leuven, Heverlee B-3001, Bel- gium. He is now with IBM, Brussels B-1130, Belgium. J. C. Gomez was with Katholieke Universiteit Leuven, Heverlee B-3001, Belgium. He is now with ITESM, Monterrey 64710, Mexico (e-mail: juancarlos.gomez@invitados.itesm.mx). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TIFS.2010.2063024 and importance of automated content ﬁltering, e.g., on intercon- nected networks including the Internet, it can be expected that salting becomes more widespread and sophisticated. This paper proposes a method for detecting hidden or dis- torted content in a digital text source that is not perceived by the human end user when the plaintext (i.e., the literal, original, ma- chine-readable version, e.g., an HTML source) of the digital text is rendered through text (re)production processes on output de- vices (e.g., the user’s monitor screen) (see Fig. 1). The method generates the covertext of the digital text, which is deﬁned as the truly perceived content of the source. The paper gives evidence that hidden salting can be correctly detected and resolved by looking not only at the source text, but also at what a user per- ceives. The proposed method consists of tapping into the ren- dering process of the source text and analyzing the rendering commands to identify portions of the text which are invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. More- over, the text deemed visible (i.e., the covertext) is reconstructed from the visible characters by identifying the character reading order which could differ from the rendering order stated in the source. This cognitive model of the text’s reading order relies on several language models (i.e., statistical models of text written in a certain language) obtained from a large collection of texts. Reconstructing the reading order of characters often implies the segmentation of the text in blocks with a uniform reading order. The computation of this segmentation is seen as a greedy search process among different possible segmentations, but can still be computationally expensive for complex messages that contain many tables and frames. Nevertheless, this method evidences the character sequences in the source that are manipulated. This is valuable as the ﬁltering replaces expensive manual analyses, for instance when inspecting e-mails and building rules for spam ﬁltering. We evaluate the detection and resolution of hidden salting in a set of e-mails, we generate statistics on hidden salting in two e-mail corpora, and we assess the effectiveness of the restored text in an e-mail ﬁltering task. The contribution of this work regards technologies for de- tecting the presence of salting and the resolution to what the user of the content really perceives. Our ﬁndings overturn the traditional deﬁnition of text currently in use when processing digital media, where a text is seen as a sequence of word tokens and each word consists of a sequence of characters. Because of its communicative function, a text—in our view—is deﬁned by what a user perceives, no matter how it is now or in the future digitally constructed. The digital textual source gives us addi- tional information on how the text is constructed and possibly manipulated. This aspect provides a timeless dimension to our research and transcends applications such as e-mail ﬁltering. 1556-6013/$26.00 © 2010 IEEE