DOI:10.1039/b411033a
Thisjournalis © TheRoyalSocietyofChemistry2004 3294 Org.Biomol.Chem. ,2004, 2 ,3294–3300
OBC
www.rsc.org/obc
ARTICLE
Chemical documents: machine understanding and automated
information extraction†
Joe A. Townsend, Sam E. Adams, Christopher A. Waudby, Vanessa K. de Souza,
Jonathan M. Goodman* and Peter Murray-Rust
Unilever Centre for Molecular Science Informatics, Department of Chemistry, Lensfield Road,
Cambridge, UK CB2 1EW. E-mail: pm286@cam.ac.uk; Fax: +44 1223 763076;
Tel: +44 1223 763069
Received21stJuly2004,Accepted8thOctober2004
FirstpublishedasanAdvanceArticleontheweb20thOctober2004
Automatically extracting chemical information from documents is a challenging task, but an essential one for dealing
with the vast quantity of data that is available. The task is least difficult for structured documents, such as chemistry
department web pages or the output of computational chemistry programs, but requires increasingly sophisticated
approaches for less structured documents, such as chemical papers. The identification of key units of information,
such as chemical names, makes the extraction of useful information from unstructured documents possible.
Introduction
Scientific information is global and many disciplines now
recognise the need to make their data widely and freely acces-
sible. The development of the Semantic Web
1
and the Grid
(typified by the UK eScience program) is based on instant
access to raw and processed scientific data. In biosciences,
for example, web-based databases are commonplace and in
many cases are seen as the first place for finding and re-using
information. Examples are Ensembl,
2
the Protein Data Bank
3
and SwissProt,
4
all of which contain highly structured data
with varying degrees of curation and annotation. These data
are “machine-understandable”—a computer can not only read
the characters (“machine-readable”) but also has semantics and
metadata which allow it to take autonomous actions such as
aligning sequences or discovering binding sites.
Much modern bioscience (“systems biology”) is multidi-
sciplinary and relies on integrating data from different
disciplines. Many of these have less structured data, and the
cost of abstracting this in a traditional human manner is often
too large. There is, therefore, a considerable effort in machine
extraction of information from the primary literature and
other related sources. It is noticeable that biosciences have
a great need for structured chemical information and this
is not currently available in open, machine-accessible, and
understandable form.
This article highlights the need for machine-based informa-
tion extraction in chemistry. We distinguish information retrieval
(the process of identifying a document or subdocument by its
associated concepts) from information extraction (obtaining
structured information from the document). Information
extraction can be used for many purposes:
• populating a structured database (e.g. of chemical names
and connection tables)
• compiling a lexicon or dictionary of commonly used terms
(e.g. solvents)
• building an ontology for semantic processing and machine
reasoning, for example by the OWL language
5
• data-mining (building predictive models from data)
• re-input into computational chemistry programs
• proof-checking (e.g. for self-consistent data).
We note that chemical information is micro-published and that
few if any chemical projects publish collections of structured
data. Nor, apart from chemical and protein crystallography,
6
is
there any standard method of publishing structured chemical
information either in primary or secondary publications.
Chemical information is available in a huge quantity and
diverse quality. The search for particular data may begin with
an index or a database, but ultimately it is necessary to read the
papers themselves in order to be sure that the right information
is available. If it were possible to set a computer to read the
literature on our behalf, this major task could be removed, or at
least reduced. Machine understanding of this vast resource of
data is not currently possible.
Chemistry is one of the most fruitful disciplines for infor-
mation extraction as there is considerably more de facto
uniformity than other disciplines:
• concepts are very well understood (many have survived for
over 100 years)
• terms are often well formalised (e.g. through IUPAC).
• many articles are, by convention, highly structured and
relatively homogenous between publishers
• in some areas (e.g. chemical diagrams) the number of tools
in common use is small, so there is a de facto uniformity of
approach
• much chemistry occurs in regulated processes (patents, drug
regulatory) which require highly structured documents
• much information is computer-generated or mediated
(computational chemistry, spectra, etc.).
Information extraction uses many aspects of document
structure and content. Simple and important examples are
the commonly used words and phrases (entities) that identify
instances of essential concepts. In chemistry these include:
• bibliographic components (authors, journals)
• molecular identity (name, connection table, synonym)
• properties (units, physical properties, colours, form/nature)
• procedures: (solvents, amounts, colours, reagents, techniques)
• instruments (manufacturer, specification)
These can be used to give context or to classify documents
or subdocuments.
It is possible to automatically extract information from
chemical papers, cross check it and assemble it into searchable
databases. This is only possible because the chemical literature
has a reasonably rigid structure, which is centred on molecules.
Results
Web pages
Most university chemistry departments maintain web pages
with information about their staff and their activities. This is a
† This is one of a number of contributions on the theme of molecular
informatics, published to coincide with the RSC Symposium “New
Horizons in Molecular Informatics”, December 7th 2004, Cambridge
UK.