Protein Engineering vol.6 no.4 pp.391-395, 1993 The SBASE domain library: a collection of annotated protein segments Sandor Pongor u , Vesna Skerl 14 , Miklfc Cserzb 1 * 5 , Zsolt Hitsigi 2 ' 3 , Gyorgy Simon 1 and Valeria Bevilacqua 1 'international Centre for Genetic Engineering and Biotechnology, Area Science Park, 34012 Trieste, Italy, ^ABC Institute for Biochemistry and Protein Research, 2100 Godollo, Hungary and 3 Department of Computer Sciences, The University of Chicago, Chicago IL, 60637, USA 4 Permanent address: Institute for Nuclear Sciences—Vinca, PO Box 522, Belgrade, Yugoslavia 'Permanent address: Biological Research Center, Hungarian Academy of Sciences, 1518 Budapest 7, Hungary SBASE is a database of annotated protein domain sequences representing various structural, functional, ligand binding and topogenic segments of proteins. The current release of SBASE contains 27 211 entries which are provided with standardized names in order to facilitate retrieval. SBASE is cross-referenced to the major protein and nucleic acid databanks as well as to the PROSITE catalog of protein sequence patterns [Bairoch,A. (1992) Nucleic Acids Res., 20, Suppl., 2013-2118]. SBASE can be used to establish domain homologies through database search using programs such as FASTA [Lipman and Pearson (1985) Science, 227, 1436-1441], FASTDB [Brutlag et al. (1990) Comp. Appl. BioscL, 6, 237-245] or BLAST3 [Altschul and Lipman (1990) Proc. Natl. Acad. Sci. USA, 87, 5509-5513], which is especially useful in the case of loosely defined domain types for which efficient consensus patterns cannot be established. The use of SBASE is illustrated on the DNA binding protein Brain-4. The database and a set of search and retrieval tools are freely available on request to the authors or by anonymous 'ftp' file transfer from <ftp.icgeb.trieste.it>. Key words: distant homologies/domains/protein sequence patterns Introduction Database search is probably the most widely used approach to predict the biological function of a newly determined protein sequence. If two large proteins share only a few common domains, detection of homology often becomes problematic. First, random identities present in the alignment may 'mask' the biologically important sequence patterns (Barker et al., 1988; Baron et al., 1991). Second, search programs such as FASTA (Lipman and Pearson, 1985), FASTDB (Brutlag et al., 1990) or BLAST3 (Altschul and Lipman, 1990) give only the best homology regions between the query and a database entry, so weaker homologies between the query and the same entry can remain undetected. SBASE is a comprehensive collection of over 27 000 annotated protein sequence segments consistendy named by structure, function, biased composition, binding specificity and/or similarity to other proteins. SBASE can be considered as a conversion of the protein sequence database into a format that facilitates detection of functional and structural similarities rather than sequence homologies. Searching this database with FASTA or FASTDB yields information on the potential functions of the detected homology regions which may allow or at least facilitate the detection of domain homologies and prediction of function. Description of the database Sources of domain sequence data Sequence data in SBASE originate from three different sources: (i) from the SWISS-PROT protein sequence databank (Bairoch and Boeckmann, 1992); (ii) from the Protein Sequence Database of the Protein Identification Resource (PIR) (Barker et al., 1992); and (iii) from the literature. These entries are either keyed in manually or are extracted and translated from the EMBL (Higgins et al., 1992) or GenBank (Burks et al., 1992) nucleotide sequence databases. Definitions of domains Domains included in SBASE are sequence segments with known structure and/or function. Several main types of domains are defined as follows, (i) Structural domains are sequence segments with a known structure [like the protein modules (Bork, 1992) such as epidermal growth factor-like (EGF-like) domains, immunoglobulin-like (IG-like) domains], biased composition (e.g. serine/threonine-rich domains) as well as various sequence repeats, (ii) Homology domains are regions of homology to other proteins detected by the original authors. These homology regions are less well characterized then the 'established' structural domains but can be eventually used to define further domain types, (iii) Ligand binding domains are sequence segments known to bind specific ligands (such as DNA, metals, sugars, etc.) (iv) Cellular location domains are sequence segments known to be involved in targeting (signal peptides, nuclear-localization signals, chloroplast transit peptides), as well as domains of transmembrane proteins (cytoplasmic, transmembrane and extracellular domains). Redundancy of sequences in SBASE is kept to a minimum. In some cases, however, domains are defined in an overlapping fashion. For example, an extracellular domain (cellular location domain) may contain epidermal growth factor type repeats which are also represented as individual entries. In order to facilitate retrieval of sequences belonging to the same domain type SBASE domain names are standardized. Examples of these standardized names/keywords are listed in Table I. Table II lists the main domain types included in SBASE. Each of the main domain types is represented by domain group and the number of corresponding entries in SBASE are given in the table. Domain boundaries are used as defined by the original authors or are determined by similarity to domains with defined boundaries. The statistical distribution of sequence lengths in SBASE is summarized in the histogram in Figure 1. Format The SBASE domain library is composed of domain sequence entries. Each entry is composed of lines, with a format similar to that used by the EMBL and SWISS-PROT databases. A sample entry is shown in Figure 2. The name of the domain is contained in the DE (definition) lines and is also partly repeated in the ID 391 at Lunds Universitet on June 22, 2012 http://peds.oxfordjournals.org/ Downloaded from