Source Printer Identification from Document Images Acquired using Smartphone Sharad Joshi, Suraj Saxena, Nitin Khanna, Member, IEEE Abstract—Vast volumes of printed documents continue to be used for various important as well as trivial applications. Such applications often rely on the information provided in the form of printed text documents whose integrity verification poses a challenge due to time constraints and lack of resources. Source printer identification provides essential information about the origin and integrity of a printed document in a fast and cost- effective manner. Even when fraudulent documents are identified, information about their origin can help stop future frauds. If a smartphone camera replaces scanner for the document acquisition process, document forensics would be more eco- nomical, user-friendly, and even faster in many applications where remote and distributed analysis is beneficial. Building on existing methods, we propose to learn a single CNN model from the fusion of letter images and their printer-specific noise residuals. In the absence of any publicly available dataset, we created a new dataset consisting of 2250 document images of text documents printed by eighteen printers and acquired by a smartphone camera at five acquisition settings. The proposed method achieves 98.42% document classification accuracy using images of letter ‘e’ under a 5×2 cross-validation approach. Further, when tested using about half a million letters of all types, it achieves 90.33% and 98.01% letter and document classification accuracies, respectively, thus highlighting the ability to learn a discriminative model without dependence on a single letter type. Also, classification accuracies are encouraging under various acquisition settings, including low illumination and change in angle between the document and camera planes. Index Terms—Printer Classification, Convolutional Neural Network, Forgery Detection, Printer Dataset, Smartphone. I. I NTRODUCTION U SAGE of digital documents has increased sharply in the last decade. However, security issues, cost of transition, and acceptability by the workforce restrict a complete tran- sition from printed to digital documents. Such restrictions have encouraged continued usage of printed documents in many financial and administrative dealings such as agreements, deeds, business communication, and record-keeping. So, there is co-existence of digital and printed documents. As per a global forest product facts and figures 2018 report provided by the Food and Agriculture Organization of the United Nations, production of printing and writing paper was 96 million tonnes in 2018 [1] and has been steady since 2014. The humongous volume of printed documents requires fast and accurate digital systems to predict their origin and integrity. Information about the source printer can provide useful information about the origin and integrity of a printed document to an investigator. Sharad Joshi, Suraj Saxena, and Nitin Khanna are with the Multimedia Analysis and Security (MANAS) Lab, Electrical Engineering, Indian Insti- tute of Technology Gandhinagar (IITGN), Gujarat, 382355 India. E-mail: {nitinkhanna}@iitgn.ac.in Printers Printed Documents Smartphone Camera Smartphone Processed Documents Digital Documents Documents Linked to Source Printers Real World Scenario (Source Printer Unkown) Fig. 1: Overview of problem scenario The problem of attributing the source printer to a printed document has been studied extensively in the literature using digital methods [2], [3]. Two main approaches to source printer identification include (1) extrinsic methods based on embed- ding a signal in the printed document [4] and (2) intrinsic methods that exploit artifacts introduced by the combination of various electro-mechanical parts of a printer [5]. Apart from being costly and complex, extrinsic solutions require access to the printer before the document is printed, which may not be feasible as manufacturers are not legally bound to integrate such solutions into their printers. On the other hand, intrinsic solutions only require sample document(s) printed from the printer and can be used to investigate documents printed in the past. All the existing methods in the literature make use of a reference scanner to acquire a digital image of the printed document. On the other hand, smartphones with an in-built camera have become very common. As compared to scanners, smartphones are compact, easy to use, and can be quickly deployed to acquire and transmit document images in a remote working environment. Most importantly, they are light-weight gadgets that human beings have become accustomed to carry along almost all the time. The document analysis community has recently started working towards replacing batch scanners by smartphone cameras [6]. In document forensics, a very recent approach proposed a method for source identification of colored images printed by color laser printers [7]. There are significant differences between the working of color and ‘black-and-white’ (grayscale) printers. So, the method pro- posed in [7] cannot be used on grayscale documents. In this work, we explore the scenario whereby a smartphone arXiv:2003.12602v1 [cs.CV] 27 Mar 2020