SOFTWARE REVIEW Open Access
Layout-aware text extraction from full-text PDF of
scientific articles
Cartic Ramakrishnan
1*
, Abhishek Patnia
2
, Eduard Hovy
1
and Gully APC Burns
1
Abstract
Background: The Portable Document Format (PDF) is the most commonly used file format for online scientific
publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents
a significant challenge for developers of biomedical text mining or biocuration informatics systems that use
published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’
(LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining
applications.
Results: Our paper describes the construction and performance of an open source system that extracts text blocks
from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize
specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant
as a baseline for further experiments into more advanced extraction methods that handle multi-modal content,
such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using
spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical
categories using a rule-based method and (3) Stitching classified text blocks together in the correct order
resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text
blocks and classify them into rhetorical categories with Precision
1
= 0.96% Recall = 0.89% and F1 = 0.91%. We also
present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have
compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed
Central. We then compared this accuracy with that of the text extracted by the PDF2Text system,
2
commonly used
to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of
improvement.
Conclusions: LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The
release of the system is available at http://code.google.com/p/lapdftext/.
Background and motivation
The field of Biomedical Natural Language Processing
(BioNLP) is maturing, with specific fields of software de-
velopment in response to user requirements: e.g., links
between databases and literature, better tool interactivity
and integration and the development of high-quality NLP
resources [1,2]. NLP techniques such as Named Entity
Recognition [3] and Semantic Relation Extraction [4] have
been shown to be very useful to biologists studying pro-
tein-protein interactions [5] and Gene-Disease-Phenotype
relations [6]. Given the ubiquity of the ‘Portable Docu-
ment Format’ (PDF) as a means of distributing scientific
publications and since access to information in full-text
documents is vital for developing effective text-mining
applications [7], it is essential to the general BioNLP com-
munity that developers of such applications can extract
the textual content from PDF files accurately with open-
source tools. Many past biomedical text mining studies
have used either the abstracts of scientific papers [8-11] or
relatively small collections of full-text articles sampled
from the Open Access subset of PubMed Central [12]. It
is likely that certain content of journals of interest in a
particular task is not distributed as a part of the Open
Access subset.
* Correspondence: cartic@isi.edu
1
Information Sciences Institute, University of Southern California, 4676
Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695, USA
Full list of author information is available at the end of the article
© 2012 Ramakrishnan et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Ramakrishnan et al. Source Code for Biology and Medicine 2012, 7:7
http://www.scfbm.org/content/7/1/7