BIOINFORMATICS Vol. 17 no. 6 2001 Pages 533–534 Technical comment to “Database verification studies of SWISS-PROT and GenBank” by Karp et al. Rolf Apweiler 1, , Paul Kersey 1 , Viv Junker 1 and Amos Bairoch 2 1 The EMBL Outstation—The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 2 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland In their paper “Database verification studies of SWISS- PROT and GenBank” Karp et al. (2001) conclude: (1) “SWISS-PROT is more incomplete than we ex- pected. . . ”; (2) “Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset”; (3) “In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, . . . ”; and (4) “. . . that SWISS-PROT does not identify a significant number of experimentally characterized proteins”. These results, and the approach used to arrive at these results, are in our opinion somewhat misleading. Herein, we only focus on four major points. First, there has never been a claim that SWISS-PROT is comprehensive. Thus, it is surprising that Karp et al. found that “SWISS-PROT is more incomplete than we expected. . . ”. To make sequences available as quickly as possible without diluting the quality of SWISS-PROT, the supplemental database TrEMBL was introduced in 1996 and contains the translation of all coding se- quences (CDS) in the DDBJ/EMBL/GenBank nucleotide sequence database, except those already included in SWISS-PROT. Snapshots of the SWISS-PROT, TrEMBL and TrEMBLnew databases are released weekly, syn- chronised with the DDBJ/EMBL/GenBank nucleotide sequence database and provide comprehensive cover- age (ftp://ftp.ebi.ac.uk/pub/databases/sp tr nrdb/). The weekly comprehensive SWISS-PROT/TrEMBL nonre- dundant database (SPTR) has been widely publicised on the EBI and ExPASy web-servers and in various publications (e.g. Apweiler, 2000). Second, the authors’ assertions that “Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the com- bined dataset.” and “SWISS-PROT curators apparently chose not to replace existing SWISS-PROT sequences with sequences from complete-genome projects” are rather inaccurate. Karp et al. tried to establish corre- sponding sets of SWISS-PROT/TrEMBL proteins and To whom correspondence should be addressed. Email: apweiler@ebi.ac.uk DDBJ/EMBL/GenBank coding sequence translations by sequence similarity searches between SWISS-PROT data from release 38, data from an unspecified TrEMBL release, and the data originally submitted to GenBank, which represents an outdated version of the genomic sequences. This methodology is questionable, since changes to sequence, both in SWISS-PROT and in the nucleotide sequence databases, imply that sequence identity cannot be used for tracking entries between databases. For this reason, we use the ‘Protein Sequence Identifier’ to cross-reference with coding sequences in the nucleotide sequence databases. The specific format for cross- references from SWISS-PROT or TrEMBL to CDS in the DDBJ/EMBL/GenBank nucleotide sequence database is: DR EMBL; ACCESSION_NR; PROTEIN_ID; STATUS_IDENTIFIER. For example: DR EMBL; AJ000012; CAA03857.1; -. The secondary identifier is here the ‘protein id’, which stands for the ‘Protein Sequence Identifier’. It is a string, which is stored in a qualifier called ‘/protein id’ tagged to every CDS in the DDBJ/EMBL/GenBank nucleotide databases. For instance: FT CDS 302..2674 FT /protein_id=‘‘CAA03857.1’’ FT /db_xref=‘‘SWISS-PROT:P26345’’ Use of these identifiers allows the identification of all proteins in SWISS-PROT and TrEMBL that correspond to coding sequences in a given completed genome sequence. In this way, up-to-date non-redundant protein sets are pro- duced each week for each completed genome (Apweiler et al., 2001; http://www.ebi.ac.uk/proteome/). The reason these sets are produced weekly is that genome sequence data is frequently updated after the c Oxford University Press 2001 533