Word Length Based Zero-Watermarking Algorithm for Tamper Detection in Text Documents Zunera Jalil, Anwar M. Mirza, and Hajira Jabeen FAST National University of Computer and Emerging Sciences, A.K. Barohi Road, H-11/4, Islamabad, Pakistan E-mail: {zunera.jalil, anwar.m.mirza, hajira.jabeen}@nu.edu.pk Abstract- Copyright protection and authentication of digital content has become a major concern in the current digital era. Plain text is the widely used means of information exchange on the Internet and it is essential to verify the authenticity of information in any form of communication. There are very limited techniques available for plain text watermarking, authentication, and tamper detection. This paper presents a novel zero-watermarking algorithm for tamper detection in plain text documents. The algorithm generates a watermark based on the text contents which can be extracted later using extraction algorithm to identify the status of tampering in the text document. Experimental results demonstrate the effectiveness of the algorithm against random tampering attacks. Watermark pattern matching and watermark distortion rate are used as evaluation parameters on multiple text samples of varying length. Keywords- watermarking; tamper detection; authentication; security; algorithm I. INTRODUCTION The copyright protection and authentication of digital content has become more important with the increasing use of Internet, e-commerce, and other communication technologies effectively. Besides, making it easier to access information in a short time span, it has become difficult to protect copyrights of digital content and to prove the authenticity of the obtained information. Digital contents mostly comprises of text, image, audio and video. Authentication and copyright protection of digital image, audio, and video has been given due thought by the researchers in past. However, authentication, tamper detection, and copyright protection of plain text have been ignored. Most of the digital contents like websites, e-books, articles, news, chats, SMS, are in the form of plain text. The threats of illegal copying, tampering, imitation, plagiarism, forgery, and other forms of possible disruption need to be exclusively addressed for the plain text. Digital watermarking provides a solution to authenticate and to protect digital contents. Digital watermarking methods are used to identify the original copyright owner (s) of the contents which can be an image, a plain text, an audio, a video or a combination of all. A digital watermark is visible or invisible (preferably the later) identification code that is permanently embedded in the data. It means that unlike conventional cryptographic techniques, it remains present within the data even after the decryption process [1]. A text is the easiest mode of communication and information exchange, brings many challenges when it comes to copyright protection and authentication. All changes to the text must preserve the value, utility, meaning and grammaticality of the text. Short documents are harder to protect and authenticate since a simple analysis would easily reveal the watermark. In image, audio, and video watermarking the limitations of Human Visual and/or Human Auditory System and inherent redundancies are exploited for watermark embedding. It is difficult to find such limitations and redundancy in plain text, since text is sensitive to any modification required for watermark embedding. Text is easier to copy, reproduce and tamper as compared with images, audio and video. Text being a specialized medium requires specialized copyright protection and authentication solutions. Traditional watermarking algorithms modify the contents of the digital medium to be protected by embedding a watermark. This traditional watermarking approach is not practical for plain text. A specialized watermarking approach such as zero- watermarking would do the needful for plain text. In this paper, we propose a novel zero- watermarking algorithm which utilizes the contents of text itself for its authentication. A zero-watermarking algorithm does not change the characters of original data, but utilize the characters of original data to construct original watermark information [2-3]. The paper is organized as follows: Section 2 provides an overview of the previous work done on text watermarking. The proposed embedding and extraction algorithm are described in detail in section 3. Section 4 presents the experimental results for the tampering (insertion, deletion and re-ordering) attacks. Performance of the proposed algorithm is evaluated on multiple text samples. The last section concludes the paper along with directions for future work. II. PREVIOUS WORK Text watermarking for authentication of text documents is an important area of research; however, the work done in this domain in past is very inadequate. The work on text watermarking initially started in 1991. A number of text watermarking techniques have been proposed since then. These include text watermarking using text images, synonym based, pre-supposition based, syntactic tree based, noun-verb based, word or sentence based, acronym based, typo error based methods and many others. V6-378 978-1-4244-6349-7/10/$26.00 c 2010 IEEE