Extracting structure from HTML documents for language visualization and analysis Robert P. Futrelle, Andrea Elaina Grimes and Mingyan Shao Biological Knowledge Laboratory College of Computer and Information Science Northeastern University Boston, MA 02115 {futrelle,agrimes,myshao}@ccs.neu.edu Abstract Document analysis is shifting from document image analysis to the analysis of electronic documents, especially those available on the Web in HTML and PDF formats. We are analyzing a 250M word collection of HTML formatted papers from the American Society for Microbiology with the ultimate goal of doing query answering and information extraction. Each document is converted to a sequence of token-id items by an invertible process called Extreme Tokenization. A lexicon is constructed with attributes including: token string, tag, capitalized, etc. A number of structures are identified, including section titles, figure captions, document navigation tables and most importantly, running text blocks. An XML descriptive structure is built using JAXB 1.0. Sentence boundaries are discovered. Language framework patterns are visualized in a custom Framework Viewer to identify important patterns of expression for further analysis. This work complements our diagram analysis research (ICDAR03). 1. Introduction Our group focuses on the biology literature, which is a large and rich source of information for biology research and medicine [1]. The goal is to discover the content of text and diagrams in research papers to support information extraction and concept-based queries. This paper focuses on discovering and extracting running text segments in HTML documents, followed by language pattern visualization as a step towards computational linguistic analysis. The methods use highly efficient integer-based (UID) representations for lexical items and text streams which support the use of relational databases for scalability to multi-billion word textbanks. We are developing a 250M word textbank drawn from the web- based Highwire Press collection. The HTML structures in the collection are consistent and not overly complex. There have been decades of important work in the analysis of document images [2]. Current document image systems can discover and label components and extract the reading order [3, 4]. Large numbers of documents are now available in electronic format, so that image analysis is bypassed. These electronic documents, and HTML in particular, still present challenges for structural analysis and content extraction. For information extraction from HTML, wrappers are developed that describe the structures containing information, e.g., headings or specific table elements. Manually designing wrappers is infeasible for large heterogeneous collections, so wrapper induction procedures have been developed [5]. Entropy measures [6] and "visually" based methods [7] have been devised for identifying content blocks. XML is a far more expressive means of representing document content, so systems have been developed to convert HTML documents to XML, adding semantics at the same time [8]. The XDOC workbench has been developed for such XML manipulations. Lessons learned from a biology text mining competition, the KDD Challenge Cup 2002, are reviewed in [9]. An example of a specific task is extracting synonymous gene and protein terms [10]. Still another task is understanding captions in biomedical publications [11], closely related to our own work on diagrams [12]. In analyzing any HTML document, one of the first issues that must be faced is tokenization of the text stream. Some of the standard approaches to the tokenization for natural language, e.g., dealing with hyphenation and other punctuation, are discussed in [13]. One of the first papers that dealt with tokenization and parsing of marked up text was produced by our group [14]; our current approach to these problems uses a method we call Extreme Tokenization, discussed below. The 250M word textbank we use is licensed from the American Society for Microbiology (ASM) and is part of the Highwire Press collection. There are about 50G words in the 12M papers in the Highwire collection, http://highwire.stanford.edu/. They are in essentially the same format as our ASM papers, so our approach should apply to all of them with little modification. These textbanks are static, read-only collections. Once published, papers are never altered. This allows us to build large and efficient data structures to represent any 3