SOFTWARE REVIEW Open Access Layout-aware text extraction from full-text PDF of scientific articles Cartic Ramakrishnan 1* , Abhishek Patnia 2 , Eduard Hovy 1 and Gully APC Burns 1 Abstract Background: The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results: Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision 1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2 commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement. Conclusions: LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/. Background and motivation The field of Biomedical Natural Language Processing (BioNLP) is maturing, with specific fields of software de- velopment in response to user requirements: e.g., links between databases and literature, better tool interactivity and integration and the development of high-quality NLP resources [1,2]. NLP techniques such as Named Entity Recognition [3] and Semantic Relation Extraction [4] have been shown to be very useful to biologists studying pro- tein-protein interactions [5] and Gene-Disease-Phenotype relations [6]. Given the ubiquity of the ‘Portable Docu- ment Format’ (PDF) as a means of distributing scientific publications and since access to information in full-text documents is vital for developing effective text-mining applications [7], it is essential to the general BioNLP com- munity that developers of such applications can extract the textual content from PDF files accurately with open- source tools. Many past biomedical text mining studies have used either the abstracts of scientific papers [8-11] or relatively small collections of full-text articles sampled from the Open Access subset of PubMed Central [12]. It is likely that certain content of journals of interest in a particular task is not distributed as a part of the Open Access subset. * Correspondence: cartic@isi.edu 1 Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695, USA Full list of author information is available at the end of the article © 2012 Ramakrishnan et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ramakrishnan et al. Source Code for Biology and Medicine 2012, 7:7 http://www.scfbm.org/content/7/1/7