Natural Language Watermarking and Robust Hashing Based on Presuppositional Analysis Olga Vybornova and Benoit Macq Communications and Remote Sensing Lab, Universite Catholique de Louvain, Belgium vybornova@tele.ucl.ac.be, macq@tele.ucl.ac.be Abstract We propose a method of text watermarking and hashing based on natural-language semantic structures. In particular, we are interested in the linguistic semantic phenomenon of presupposition. Presupposition is implicit information that is taken for granted by the reader and establishes common ground between the author’s and reader’s situational knowledge; it is a semantic component of certain linguistic expressions (lexical items and syntactic constructions called presupposition triggers). The same sentence can be used with or without presupposition, provided that all the relations between discourse referents are preserved. The number of presuppositions in randomly grouped sentences and the web of resolved presupposed information in the text holds the watermark (e.g. integrity watermark, or prove of ownership), introducing “secret ordering” into the text structure to make it resilient to a certain amount of data altering attacks. This intrinsic structure of the text can be also used as a robust hash of the text. 1. Introduction Attempts of semantic-level text analysis for the purpose of watermarking has been attractive for several years since 2000 when the research based on lexical substitution in synonyms sets was first proposed (see [9] for more details). Later [1], [2] proposed algorithms to embed information in the tree structure of the text. In these methods the watermark was not directly embedded in the text, but in the parsed representation of sentences. Linguistic automatic semantic analysis is still not a well developed field, but we are trying to use its best recent advances and propose a new approach for embedding watermarks with the help of semantic representations of single sentences and of the whole text. To be suitable for watermarking purposes, any embedding process should not change the meaning of the text that should be represented in a clear and readable way in order not to disturb the communication and to preserve fluency and grammaticality to comply with the grammar rules of the language. Preserving the style of the author is also very important in some domains such as news channels or literature writing [9]. In our approach [11] we distinguish between text and discourse meaning text as the result of verbal activity of this text producer (a speaker or a writer), and discourse as the verbal activity process, i.e. - the text together with all the pragmatic, psychological, cultural and other factors influencing this text generation. Such discrimination between text and discourse is important to make clearer the later discussion in Section 2.2 that includes mechanisms of dynamic semantics and context update. The text, i.e. the result of the discourse generation, is not an aggregate of separate sentences considered in isolation, the meaning of the whole text cannot always be perceived compositionally, but text is an integer entity holding all its intersentential links provided by certain linguistic means, such as anaphoric links, presuppositions, ellipsis, coreference, etc. Each sentence is a new contribution to the whole discourse, information is accumulated with every new step and consistently integrated into the previous discourse. This property of integrity and underlying semantic relationships within a text allow us to develop a new robust method of text watermarking based on efficient semantic representations of the text. 2. Prerequisites for text watermarking using semantic representations 2.1. Sentence transformations based on the presupposition triggers Our watermarking approach is based on the linguistic semantic phenomenon called presupposition. Presuppositions build the semantic basis of discourse, provide its coherence, consistency and are important for creation of the common ground between the author of the text and the readers. Presupposition is defined as a sort of implicit information which is considered well-known or