REVIEWS TIBS 21 - JULY 1996 Macromolecular structure information and databases The EU BRIDGE DataBaseProject Consortium* The current status and future outlook of macromolecular structure data- bases and information handling, with particular reference to European data- bases, are reviewed. Issues concerning the efficiency with which data are represented, validated, archived and accessed are discussed in view of the fast growing body of information on structures of biological macromolecules. PROGRESS IN GENETIC engineering, X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectro- scopy, and the advent of cheap and powerful computers, have brought about an exponential growth of data on macromolecular three-dimensional struc- tures. Although the number of protein structures known today (approximately 3500) is still well below that of protein sequences (over 50 000), its growth rate parallels that observed for protein se- quences several years ago and contin- ues to rise, with an expected 30000 known structures by the turn of this century. The expected number of nu- cleic acid structures is smaller, but will also rise steadily. Managing the information on bio- macromolecule sequences and struc- tures has become one of the major chal- lenges of modern molecular biology. This is particularly difficult in the case of protein three-dimensional structures, because it is necessary to handle a wealth of complex information, some of which must be derived (or calcu- lated) from the atomic coordinates by applying specialized programs (Box 1). Furthermore, the data must be validated, i.e. the quality and the consistency of the data must be assessed. To promote scientific progress, structural infor- mation must also be integrated with other data, such as that on function, gene structure and phylogeny, which are obtained from other sources or stored in different databanks. In addi- tion, the data must be easily and widely accessible to the molecular biology community. Meeting this challenge requires that the ways of storing, cross- *Peter M. D. Gray and Graham J. L. Kemp are at the Computing Science Department, University of Aberdeen, Aberdeen, UK AB9 2UE. Christopher J. Rawlings and Nlgel P. Brown are at the ICRF, 44 Lincoln's Inn Fields, London, UK WC2A 3PX. Christian Sander is at the EMBL, 69012 Heidelberg, MeyerhofstraSe 1, Germany. Janet M. Thornton and Christine M. Orengo are at the Biochemistry and Molecular Biology Department, University College London, London, UK WCIE 6BT. Shoshana J. Wodak and Jean Rlchelle are at the UCMB, Universit~ Libre de Bruxelles, av. F. D. Roosevelt 50 - CP160/16, 1050 Bruxelles, Belgium. © 1996,Elsevier ScienceLtd this table can be found at http://www. ucmb.ulb, ac.be/StructResources]. The primary data are archived at the Protein Structure Databank (PDB) at Brookhaven National Laboratories (BNL) 1 and the Nucleic Acid Database (NDB)2 and are freely available to all. Many countries and laboratories also keep a mirrored copy of the PDB to im- prove local access. The 'derived data' resources can be grouped according to the mode used to access them: some data are stored in ASCII files, available by file transfer; others can be 'browsed' on the using tools such as Mosaic TM or Netscape TM, which allow limited navi- gation through 'pre-canned' data or re- ports; the most sophisticated access is available when the data are organized in a database (under management sys- tems such as ORACLE ® and SYBASE~). For data retrieval, databases have the advantage of allowing for custom- designed queries to be made without writing specialized programs. An important consideration for ef- ficient access to information is the avail- ability of links between different data resources. Currently in PDB files, no referencing and accessing in- formation must be enhanced, which in turn requires that the data are organized in ways that can describe items such as atoms, residues, sec- ondary structure elements, folding motifs, domains, sub- units and crystal contacts, as well as the complex re- lationships between them. Here, we give an overview of the recent advances in the area of macromolecular struc- ture databases, with illus- trations taken from works by the European BRIDGE Data- Base Project Consortium. Current status of macromolecular structure resources Publicly available re- sources for storing, retrieving and manipulating information on macromolecular struc- tures can be divided into the archives for the primary data (obtained from the authors or in the literature) and the 'derived data' resources (Box 1). Major resources are listed in Table I [a World Wide Web 0HWW) version of Pll: s0968-0004(96) 10037-2 Box 1. Derived data Computed data from a single three-dimensional structure • Backbone and side chain dihedral angles. • Atom connectivities. • H-bonding partners. • Secondary structure assignments. • Topological description of B-sheets. • Topological and geometric description of the relative arrangement of all secondary structures. • Location of specific structure motifs: specific types of turns, supersecondary structures. • Solvent accessible surface areas of atoms, residues, whole subunits and multi-subunit complexes. • Volumes of atoms, residues, whole subunits and multi- subunit complexes, • Locations of empty and water filled cavities; their volume, surface area. • Residue-residue contact maps. • Limits of structural domains. • List of residues involved in subunit contacts, • List of residues involved in tigand binding. • List of residues involved in crystal contacts. • Pointers to homologous structures. • Pointers to homologous sequences in other databanks. • Annotation of structure quality assessment. Data computed from surveys of many structures • Preferred side chain rotamers. • Preferred modes of protein sidechain interactions. • Preferred modes of protein-water interactions. • Repertoire of turn motifs and their characteristics, • Repertoire of sheet topologies. • Repertoire of helix-helix, helix-sheet and sheet-sheet interactions. • Repertoire of folding motifs. 251