Information Extraction from Nanotoxicity Related Publications Lemin Xiao, Kaizhi Tang, Xiong Liu, Hui Yang, Zheng Chen, Roger Xu Intelligent Automation, Inc., Rockville, MD, USA E-mail: {lxiao@i-a-i.com, ktang@i-a-i.com, xliu@i-a-i.com, hyang@i-a-i.com, chenfuqing@gmail.com, hgxu@i-a-i.com} AbstractHigh-quality experimental data are important when developing predictive models for studying nanomaterial environmental impact (NEI). Given that raw data from experimental laboratories and manufacturing workplaces are usually proprietary and small-scaled, extracting information from publications is an attractive alternative for collecting data. We developed an information extraction system that can extract useful information from full-text nanotoxicity related publications. This information extraction system consists of five components: raw data transformation into machine readable format, data preprocessing, ontology-based named entity recognition, rule-based numerical attribute extraction from both tables and unstructured text, and relation extraction among entities and attributes. The information extraction system is applied on a dataset made of 94 publications, and results in an acceptable accuracy. By storing extracted data into a table according to relations among the data, a dataset that can be used to predict nanomaterial environmental impact is obtained. Such a system is unique in current nanomaterial community, and can help nanomaterial scientists and practitioners quickly locate useful information they need without spending lots of time reading articles. Keywords-Nanoinformatics; information extraction; named entity recognition; relation extraction; nanotoxicity; data mining I. INTRODUCTION Nanotoxicology, a branch of bionanoscience, is intended to determine whether and to what extent the properties of nanoparticles may pose threats to the environment and to human beings [1]. NEIMiner [2] is a model driven data mining system developed for studying the nanomaterial environmental impact (NEI). It aims at building high-quality prediction models to asses environmental toxicity of engineered nanomaterial based on scientific information, and helping industry and policymakers make risk management decisions [2][3]. In order to discover high-quality prediction models, high- quality datasets are among one of the key factors. Since raw data from experimental laboratories and manufacturing workplaces are usually proprietary and small-scaled, extracting information from publication articles can be a good choice for collecting data. Researchers and scientists usually conduct this task manually, which is very time and resources consuming. With text mining and natural language processing techniques, it is promising to develop an information extraction system that can automatically scan through the available nanotoxicity publications, extract toxicity-related information and form a dataset available for predictive models. However, developing such an information extraction system faces the following challenges: Data crawled from Web are not processable by natural language processing toolkits in most cases. Most publications online are in the format of PDF files, which is generally not supported by natural language processing programs. An information extraction system is a very complex system consisting of multiple components, each of which implements some specific functionality in text processing. Common components are tokenizer, sentence splitter, part-of-speech tagger, and some other more advanced components including named entity recognition, attribute extraction. Components have dependencies, for example, the component of entity recognition may require that several preprocessing tasks have been complemented such as tokenizing, part-of- speech tagging, parsing. Extracting relations among entities and attributes, especially when more than two entities or attributes are involved, is also a tough task. Since we are dealing with full-text publication, relations across sentences, even across paragraphs, are also desirable during information extraction. To address these challenges, we designed and developed an information extraction system capable of performing the following five tasks: 1) data transformation into machine processable format, 2) data preprocessing, 3) named entity recognition, 4) attribute extraction, and 5) relation extraction. In this paper, we first present previous work related to our task. Second, we describe the information extraction system we developed for nanotoxicity related publications. Then we present results after applying this information extraction system on our dataset. Finally, we conclude the paper and discuss future research. II. RELATED WORK There are several general categories of approaches that can be applied in nanomaterial information extraction, including rule-based [4][5], ontology-based [7][8] and machine learning based [9]-[13] approaches. A. Rule-Based Approaches Rule-based information extraction is to design handcrafted rule-based patterns and perform pattern matching. It requires a lot of manual work and is usually not reusable [4]. However, it is still a useful approach especially when extracting complicated, structured data in areas like 4235 KGGG Kpvgtpcvkqpcn Eqphgtgpeg qp Dkqkphqtocvkeu cpf Dkqogfkekpg ;9:/3/69;;/3532/91351&53022 Æ4235 KGGG 47