DOI:฀10.1039/b411033a This฀journal฀is฀ © ฀The฀Royal฀Society฀of฀Chemistry฀2004 3294 Org.฀Biomol.฀Chem. ,฀2004,฀ 2 ,฀3294–3300 OBC www.rsc.org/obc A฀R฀T฀I฀C฀L฀E Chemical documents: machine understanding and automated information extraction† Joe A. Townsend, Sam E. Adams, Christopher A. Waudby, Vanessa K. de Souza, Jonathan M. Goodman* and Peter Murray-Rust Unilever Centre for Molecular Science Informatics, Department of Chemistry, Lensfield Road, Cambridge, UK CB2 1EW. E-mail: pm286@cam.ac.uk; Fax: +44 1223 763076; Tel: +44 1223 763069 Received฀21st฀July฀2004,฀Accepted฀8th฀October฀2004 First฀published฀as฀an฀Advance฀Article฀on฀the฀web฀20th฀October฀2004 Automatically extracting chemical information from documents is a challenging task, but an essential one for dealing with the vast quantity of data that is available. The task is least difficult for structured documents, such as chemistry department web pages or the output of computational chemistry programs, but requires increasingly sophisticated approaches for less structured documents, such as chemical papers. The identification of key units of information, such as chemical names, makes the extraction of useful information from unstructured documents possible. Introduction Scientific information is global and many disciplines now recognise the need to make their data widely and freely acces- sible. The development of the Semantic Web 1 and the Grid (typified by the UK eScience program) is based on instant access to raw and processed scientific data. In biosciences, for example, web-based databases are commonplace and in many cases are seen as the first place for finding and re-using information. Examples are Ensembl, 2 the Protein Data Bank 3 and SwissProt, 4 all of which contain highly structured data with varying degrees of curation and annotation. These data are “machine-understandable”—a computer can not only read the characters (“machine-readable”) but also has semantics and metadata which allow it to take autonomous actions such as aligning sequences or discovering binding sites. Much modern bioscience (“systems biology”) is multidi- sciplinary and relies on integrating data from different disciplines. Many of these have less structured data, and the cost of abstracting this in a traditional human manner is often too large. There is, therefore, a considerable effort in machine extraction of information from the primary literature and other related sources. It is noticeable that biosciences have a great need for structured chemical information and this is not currently available in open, machine-accessible, and understandable form. This article highlights the need for machine-based informa- tion extraction in chemistry. We distinguish information retrieval (the process of identifying a document or subdocument by its associated concepts) from information extraction (obtaining structured information from the document). Information extraction can be used for many purposes: • populating a structured database (e.g. of chemical names and connection tables) • compiling a lexicon or dictionary of commonly used terms (e.g. solvents) • building an ontology for semantic processing and machine reasoning, for example by the OWL language 5 • data-mining (building predictive models from data) • re-input into computational chemistry programs • proof-checking (e.g. for self-consistent data). We note that chemical information is micro-published and that few if any chemical projects publish collections of structured data. Nor, apart from chemical and protein crystallography, 6 is there any standard method of publishing structured chemical information either in primary or secondary publications. Chemical information is available in a huge quantity and diverse quality. The search for particular data may begin with an index or a database, but ultimately it is necessary to read the papers themselves in order to be sure that the right information is available. If it were possible to set a computer to read the literature on our behalf, this major task could be removed, or at least reduced. Machine understanding of this vast resource of data is not currently possible. Chemistry is one of the most fruitful disciplines for infor- mation extraction as there is considerably more de facto uniformity than other disciplines: • concepts are very well understood (many have survived for over 100 years) • terms are often well formalised (e.g. through IUPAC). • many articles are, by convention, highly structured and relatively homogenous between publishers • in some areas (e.g. chemical diagrams) the number of tools in common use is small, so there is a de facto uniformity of approach • much chemistry occurs in regulated processes (patents, drug regulatory) which require highly structured documents • much information is computer-generated or mediated (computational chemistry, spectra, etc.). Information extraction uses many aspects of document structure and content. Simple and important examples are the commonly used words and phrases (entities) that identify instances of essential concepts. In chemistry these include: • bibliographic components (authors, journals) • molecular identity (name, connection table, synonym) • properties (units, physical properties, colours, form/nature) • procedures: (solvents, amounts, colours, reagents, techniques) • instruments (manufacturer, specification) These can be used to give context or to classify documents or subdocuments. It is possible to automatically extract information from chemical papers, cross check it and assemble it into searchable databases. This is only possible because the chemical literature has a reasonably rigid structure, which is centred on molecules. Results Web pages Most university chemistry departments maintain web pages with information about their staff and their activities. This is a † This is one of a number of contributions on the theme of molecular informatics, published to coincide with the RSC Symposium “New Horizons in Molecular Informatics”, December 7th 2004, Cambridge UK.