Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss During Document Analysis Kyle Goslin Department of Informatics and Engineering, Institute of Technology Blanchardstown, Blanchardstown Road North, Dublin 15, Ireland kylegoslin@gmail.com Markus Hofmann Department of Informatics and Engineering, Institute of Technology Blanchardstown, Blanchardstown Road North, Dublin 15, Ireland markus.hofmann@itb.ie Abstract—During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss. I. I NTRODUCTION In a number of different sectors, large repositories of doc- uments are often built up each with a wide variety of different formatting techniques and styles by a number of different authors. When a forensic text analysis of the documents in these repositories is needed, a manual analysis becomes no longer feasible. Although various different file types are used, the vast majority are often a sub collection of these file types. These file types include Microsoft .doc, .docx, .ppt, .pptx and the now open standard .pdf. Each document contains different internal representations such as plain text, XML and binary with different approaches used during the document rendering process. This variety of different file types provides a fundamental problem during the process of forensic text analysis as a variety of different tools would need to be created to deal with each different file format. As a solution to the multi-filetype problem, a single style and representation-based format, such as HTML can be used as a bridging format to which all documents can be converted into. As HTML has been around for a number of years a multiple of tools currently exist for converting from common file formats into HTML. HTML when utilized correctly can be applied to create identical representations of the original documents and used in place of the original document to pro- vide better searchability and more flexibility when analyzing documents. The quantity of files in a repository can also cause an issue as manually converting files no longer becomes a feasible option due to the time required to convert each file. For this reason, an automated conversion tool is needed. This paper outlines the background issues that arise during the conversion process of documents to HTML for forensic text analysis in Section II. Section III outlines the document types and content variations that were dealt with during this study. Section IV outlines the various tools and approaches that are currently available for converting documents to HTML/CSS counterparts. Section V outlines Experiment 1 and Experiment 2 that were used to gauge the level of loss and quality of outputs for the selected tools. Finally Section VI reflects on the overall findings from this study. II. BACKGROUND Converting PDF files to HTML can be done with a number of different tools, each of which implement varying approaches [1][2]. HTML documents are often generated to create web based representations of previously unindexable documents as they are fully text based. This need to extract text from documents has led to a number of assessments on the PDF to HTML conversion process being done in this area [3]. The utilization of document layout and styling information has been underway for a number of years. Segmentation based upon HTML structure [4] has been used in the process of information extraction. The utilization of DOM tree and bounding boxes to aid additional processing has been used for a number of different purposes such as aiding search and text matching [5], [6], [7]. 2013 European Intelligence and Security Informatics Conference 978-0-7695-5062-6/13 $26.00 © 2013 IEEE DOI 10.1109/EISIC.2013.22 100 2013 European Intelligence and Security Informatics Conference 978-0-7695-5062-6/13 $26.00 © 2013 IEEE DOI 10.1109/EISIC.2013.22 100 2013 European Intelligence and Security Informatics Conference 978-0-7695-5062-6/13 $26.00 © 2013 IEEE DOI 10.1109/EISIC.2013.22 100