REVIEWS
TIBS 21 - JULY 1996
Macromolecular structure
information and databases
The EU BRIDGE DataBaseProject Consortium*
The current status and future outlook of macromolecular structure data-
bases and information handling, with particular reference to European data-
bases, are reviewed. Issues concerning the efficiency with which data are
represented, validated, archived and accessed are discussed in view of the
fast growing body of information on structures of biological macromolecules.
PROGRESS IN GENETIC engineering,
X-ray crystallography and Nuclear
Magnetic Resonance (NMR) spectro-
scopy, and the advent of cheap and
powerful computers, have brought
about an exponential growth of data on
macromolecular three-dimensional struc-
tures. Although the number of protein
structures known today (approximately
3500) is still well below that of protein
sequences (over 50 000), its growth rate
parallels that observed for protein se-
quences several years ago and contin-
ues to rise, with an expected 30000
known structures by the turn of this
century. The expected number of nu-
cleic acid structures is smaller, but will
also rise steadily.
Managing the information on bio-
macromolecule sequences and struc-
tures has become one of the major chal-
lenges of modern molecular biology.
This is particularly difficult in the case
of protein three-dimensional structures,
because it is necessary to handle a
wealth of complex information, some
of which must be derived (or calcu-
lated) from the atomic coordinates by
applying specialized programs (Box 1).
Furthermore, the data must be validated,
i.e. the quality and the consistency of
the data must be assessed. To promote
scientific progress, structural infor-
mation must also be integrated with
other data, such as that on function,
gene structure and phylogeny, which
are obtained from other sources or
stored in different databanks. In addi-
tion, the data must be easily and widely
accessible to the molecular biology
community.
Meeting this challenge requires that
the ways of storing, cross-
*Peter M. D. Gray and Graham J. L. Kemp
are at the Computing Science Department,
University of Aberdeen, Aberdeen, UK
AB9 2UE.
Christopher J. Rawlings and Nlgel P. Brown
are at the ICRF, 44 Lincoln's Inn Fields,
London, UK WC2A 3PX.
Christian Sander is at the EMBL, 69012
Heidelberg, MeyerhofstraSe 1, Germany.
Janet M. Thornton and Christine M. Orengo
are at the Biochemistry and Molecular Biology
Department, University College London,
London, UK WCIE 6BT.
Shoshana J. Wodak and Jean Rlchelle are at
the UCMB, Universit~ Libre de Bruxelles,
av. F. D. Roosevelt 50 - CP160/16, 1050
Bruxelles, Belgium.
© 1996,Elsevier ScienceLtd
this table can be found at http://www.
ucmb.ulb, ac.be/StructResources].
The primary data are archived at
the Protein Structure Databank (PDB)
at Brookhaven National Laboratories
(BNL) 1 and the Nucleic Acid Database
(NDB)2 and are freely available to all.
Many countries and laboratories also
keep a mirrored copy of the PDB to im-
prove local access.
The 'derived data' resources can be
grouped according to the mode used to
access them: some data are stored in
ASCII files, available by file transfer;
others can be 'browsed' on the
using tools such as Mosaic TM or
Netscape TM, which allow limited navi-
gation through 'pre-canned' data or re-
ports; the most sophisticated access is
available when the data are organized
in a database (under management sys-
tems such as ORACLE ® and SYBASE~).
For data retrieval, databases have
the advantage of allowing for custom-
designed queries to be made without
writing specialized programs.
An important consideration for ef-
ficient access to information is the avail-
ability of links between different data
resources. Currently in PDB files, no
referencing and accessing in-
formation must be enhanced,
which in turn requires that
the data are organized in
ways that can describe items
such as atoms, residues, sec-
ondary structure elements,
folding motifs, domains, sub-
units and crystal contacts,
as well as the complex re-
lationships between them.
Here, we give an overview
of the recent advances in the
area of macromolecular struc-
ture databases, with illus-
trations taken from works by
the European BRIDGE Data-
Base Project Consortium.
Current status of macromolecular
structure resources
Publicly available re-
sources for storing, retrieving
and manipulating information
on macromolecular struc-
tures can be divided into the
archives for the primary data
(obtained from the authors
or in the literature) and the
'derived data' resources
(Box 1). Major resources are
listed in Table I [a World
Wide Web 0HWW) version of
Pll: s0968-0004(96) 10037-2
Box 1. Derived data
Computed data from a single three-dimensional
structure
• Backbone and side chain dihedral angles.
• Atom connectivities.
• H-bonding partners.
• Secondary structure assignments.
• Topological description of B-sheets.
• Topological and geometric description of the relative
arrangement of all secondary structures.
• Location of specific structure motifs: specific types of
turns, supersecondary structures.
• Solvent accessible surface areas of atoms, residues,
whole subunits and multi-subunit complexes.
• Volumes of atoms, residues, whole subunits and multi-
subunit complexes,
• Locations of empty and water filled cavities; their
volume, surface area.
• Residue-residue contact maps.
• Limits of structural domains.
• List of residues involved in subunit contacts,
• List of residues involved in tigand binding.
• List of residues involved in crystal contacts.
• Pointers to homologous structures.
• Pointers to homologous sequences in other databanks.
• Annotation of structure quality assessment.
Data computed from surveys of many structures
• Preferred side chain rotamers.
• Preferred modes of protein sidechain interactions.
• Preferred modes of protein-water interactions.
• Repertoire of turn motifs and their characteristics,
• Repertoire of sheet topologies.
• Repertoire of helix-helix, helix-sheet and sheet-sheet
interactions.
• Repertoire of folding motifs.
251