Journal of Biotechnology 78 (2000) 221 – 234 The role SWISS-PROT and TrEMBL play in the genome research environment Vivien Junker *, Sergio Contrino, Wolfgang Fleischmann, Henning Hermjakob, Fiona Lang, Michele Magrane, Maria Jesus Martin, Nicoletta Mitaritonna, Claire O’Donovan, Rolf Apweiler EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK Received 1 February 1999; accepted 5 July 1999 Abstract SWISS-PROT, a curated protein sequence data bank, contains not only sequence data but also annotation relevant to a particular sequence. The annotation added to each entry is done by a team of biologists and comes, primarily, from articles in journals reporting the actual sequencing and sometimes characterisation. Review articles and collaboration with external experts also play a role along with the use of secondary databases like PROSITE and Pfam in addition to a variety of feature prediction methods. Annotation added by these methods is checked for relevance and likelihood to a particular sequence. The onset of genome sequencing has led to a dramatic increase in sequence data to be included in SWISS-PROT. This has led to the production of TrEMBL (Translation of the EMBL database). TrEMBL consists of entries in a SWISS-PROT format that are derived from the translation of all coding sequences in the EMBL nucleotide sequence database, that are not in SWISS-PROT. Unlike SWISS-PROT entries those in TrEMBL are awaiting manual annotation. However, rather than just representing basic sequence and source information, steps have been taken to add features and annotation automatically. In taking these steps it is hoped that TrEMBL entries are enhanced with some indication as to what a protein is, could or may be. © 2000 Elsevier Science B.V. All rights reserved. Keywords: Genome sequence data; Annotation; Automation; SWISS-PROT; TrEMBL www.elsevier.com/locate/jbiotec 1. Introduction SWISS-PROT (Bairoch and Apweiler, 1999) is a curated protein sequence data bank, which strives to provide the necessary parameters of a public sequence database. Namely, to provide a high level of annotation (such as the description of the function of the protein, post-translational modifications, variants, etc), to have a minimal level of redundancy and to provide a high level of integration with other databases. TrEMBL (Translation of EMBL) (Bairoch and Apweiler, 1999) is a computer-annotated supplement to * Corresponding author. Fax: +44-1223-494472. E-mail address: junker@ebi.ac.uk (V. Junker) 0168-1656/00/$ - see front matter © 2000 Elsevier Science B.V. All rights reserved. PII:S0168-1656(00)00198-X