Am J Pharmacogenomics 2004; 4 (4): 247-252 DATABASES AND GENOME MAPS 1175-2203/04/0004-0247/$31.00/0 2004 Adis Data Information BV. All rights reserved. The Impact of Structural Genomics on the Protein Data Bank Helen M. Berman and John D. Westbrook Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA Contents Abstract ............................................................................................................... 247 1. The Protein Data Bank ............................................................................................... 248 2. The Data Pipeline ................................................................................................... 249 3. Capturing the Data ................................................................................................. 251 4. Target Tracking ..................................................................................................... 251 5. Conclusion ......................................................................................................... 252 The advent of structural genomics presents new challenges to the archive of biomacromolecular structures Abstract – the Protein Data Bank (PDB). As technologies involved in structure determination have advanced, both the number and size of structures available in the PDB have increased rapidly. The structural genomics initiatives are creating a large amount of data that needs to be tracked, archived, and made easily available. The PDB has developed tools to facilitate the rapid deposition of data produced by the structural genomics initiatives and has created databases to track the progress of the work. A new era has emerged in biology called structural genomics. It structures. The reasons for this increase were two-fold: (i) technol- represents the application of commonly used structural biology ogy improved, making it possible to solve a structure in days or methods to biological macromolecules on a genomic scale. To weeks rather than years; and (ii) the public’s attitude toward data understand how this new “omics” evolved, we must first step back sharing evolved, so that the norm was to make the coordinates and a little to understand the evolution of structural biology. primary data publicly available at the time of publication. The first crystal protein structure was determined in 1957. The While structure determination technology improved, the structure of the oxygen-carrying molecule, myoglobin, was the Human Genome and model organism projects produced the se- culmination of years of effort by scientists led by John Kendrew in quences of over 1000 genomes. [7] The analyses of these sequences Cambridge, England. [1,2] This structure determination was fol- have already produced invaluable information about genes, the lowed shortly thereafter by the determination of another oxygen proteins they express, and their possible functions. The next logi- carrier structure – hemoglobin – by Max Perutz. [3,4] By the late cal step beyond these genome projects is to determine the structure 1960’s, there were perhaps 10 crystal structures published in the of these proteins on a genomic scale – structural genomics. [8] scientific literature. In 1971, a community database named the There are 23 000 structures in the PDB, of which approximately ‘Protein Data Bank’ (PDB) was founded to collect and distribute 3000 have unique sequences, and even fewer have unique folds. the structures of biological macromolecules. [5] Although there is no agreement as to how many proteins are The progress of data acquisition was initially quite slow, as expressed in even the smallest organism, it is probably safe to say only a few new structures were determined each year. Then in the that there are ten times more than are currently residing in the late 1980’s things began to change as the number of structures PDB. increased (figure 1) and there was greater demand for these