Cross Domain Assessment of Document to HTML
Conversion Tools to Quantify Text and Structural
Loss During Document Analysis
Kyle Goslin
Department of Informatics and Engineering,
Institute of Technology Blanchardstown,
Blanchardstown Road North,
Dublin 15, Ireland
kylegoslin@gmail.com
Markus Hofmann
Department of Informatics and Engineering,
Institute of Technology Blanchardstown,
Blanchardstown Road North,
Dublin 15, Ireland
markus.hofmann@itb.ie
Abstract—During forensic text analysis, the automation of the
process is key when working with large quantities of documents.
As documents often come in a wide variety of different file types,
this creates the need for tailored tools to be developed to analyze
each document type to correctly identify and extract text elements
for analysis without loss.
These text extraction tools often omit sections of text that
are unreadable from documents leaving drastic inconsistencies
during the forensic text analysis process. As a solution to this a
single output format, HTML, was chosen as a unified analysis
format. Document to HTML/CSS extraction tools each with
varying techniques to convert common document formats to rich
HTML/CSS counterparts were tested. This approach can reduce
the amount of analysis tools needed during forensic text analysis
by utilizing a single document format.
Two tests were designed, a 10 point document overview test
and a 48 point detailed document analysis test to assess and
quantify the level of loss, rate of error and overall quality of
outputted HTML structures.
This study concluded that tools that utilize a number of
different approaches and have an understanding of the document
structure yield the best results with the least amount of loss.
I. I NTRODUCTION
In a number of different sectors, large repositories of doc-
uments are often built up each with a wide variety of different
formatting techniques and styles by a number of different
authors. When a forensic text analysis of the documents in
these repositories is needed, a manual analysis becomes no
longer feasible.
Although various different file types are used, the vast
majority are often a sub collection of these file types. These
file types include Microsoft .doc, .docx, .ppt, .pptx and the
now open standard .pdf. Each document contains different
internal representations such as plain text, XML and binary
with different approaches used during the document rendering
process.
This variety of different file types provides a fundamental
problem during the process of forensic text analysis as a variety
of different tools would need to be created to deal with each
different file format.
As a solution to the multi-filetype problem, a single style
and representation-based format, such as HTML can be used
as a bridging format to which all documents can be converted
into. As HTML has been around for a number of years a
multiple of tools currently exist for converting from common
file formats into HTML. HTML when utilized correctly can
be applied to create identical representations of the original
documents and used in place of the original document to pro-
vide better searchability and more flexibility when analyzing
documents. The quantity of files in a repository can also cause
an issue as manually converting files no longer becomes a
feasible option due to the time required to convert each file.
For this reason, an automated conversion tool is needed.
This paper outlines the background issues that arise during
the conversion process of documents to HTML for forensic text
analysis in Section II. Section III outlines the document types
and content variations that were dealt with during this study.
Section IV outlines the various tools and approaches that are
currently available for converting documents to HTML/CSS
counterparts.
Section V outlines Experiment 1 and Experiment 2 that
were used to gauge the level of loss and quality of outputs for
the selected tools. Finally Section VI reflects on the overall
findings from this study.
II. BACKGROUND
Converting PDF files to HTML can be done with a number
of different tools, each of which implement varying approaches
[1][2]. HTML documents are often generated to create web
based representations of previously unindexable documents
as they are fully text based. This need to extract text from
documents has led to a number of assessments on the PDF to
HTML conversion process being done in this area [3].
The utilization of document layout and styling information
has been underway for a number of years. Segmentation
based upon HTML structure [4] has been used in the process
of information extraction. The utilization of DOM tree and
bounding boxes to aid additional processing has been used for
a number of different purposes such as aiding search and text
matching [5], [6], [7].
2013 European Intelligence and Security Informatics Conference
978-0-7695-5062-6/13 $26.00 © 2013 IEEE
DOI 10.1109/EISIC.2013.22
100
2013 European Intelligence and Security Informatics Conference
978-0-7695-5062-6/13 $26.00 © 2013 IEEE
DOI 10.1109/EISIC.2013.22
100
2013 European Intelligence and Security Informatics Conference
978-0-7695-5062-6/13 $26.00 © 2013 IEEE
DOI 10.1109/EISIC.2013.22
100