Searching the Protein Structure Databank with Weak Sequence Patterns and Structural Constraints Inge Jonassen 1 , Ingvar Eidhammer 1 , Svenn H. Grindhaug 1 and William R. Taylor 1,2 * 1 Department of Informatics University of Bergen Hùyteknologisenteret (P.B. 7800), N-5020 Bergen, Norway 2 Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK A method is described in which proteins that match PROSITE patterns are ®ltered by the root-mean-square deviation of the local 3D structures of the probe and target over the pattern components. This was found to increase the discrimination between true and false members of the protein family but was dependent on how unique the structural features in the pattern were compared to equivalent fragments extracted from the structure databank (for example; if the pattern fell in an a-helix, then discrimination was poor.) We then generalised the sequence patterns (by widening the range of amino acid residues allowed at each position) and monitored how well the structural information helped retain speci®city. While the discrimination of the pure sequence pattern had generally disappeared at information content values less than ten bits, the discrimi- nation of the combined sequence structure probe remained high at this point before following a similar decay. The displacement between these curves indicates that the structural component is, on average, equivalent to about ten bits. The sequence patterns were also ®ltered using the struc- ture comparison program SAP, giving a global, rather than local ``view'' of the proteins. This allowed the information content of the sequence pat- terns to become even less speci®c but raised problems of whether some proteins encountered with the same fold but no PROSITE pattern should constitute family members. # 2000 Academic Press Keywords: PROSITE; protein sequence patterns; structure comparison *Corresponding author Introduction Proteins can be grouped into families by simi- larity of structure or function. For each family, the sequences and structures that are known can be analysed and common features found that can help in the understanding of the biology of the family. These shared features can often be described as a pattern which can be used to ident- ify new family members. A number of databases of family descriptors exist, most of which give sequence patterns for each family (Hofmann et al., 1999; Bateman et al., 1999; Corpet et al., 1999; Attwood etal., 1999; Wallace etal., 1996). A sequence pattern can be a regular expression, a weight matrix, a pro®le, or a hidden Markov model (HMM). The outcome of matching a sequence against a pattern is discrete (yes/no) for regular expressions which are therefore called deterministic patterns. Matching a sequence against a weight matrix, a pro®le, or a HMM pro- duces a number quantifying the quality of the match; such patterns are called statistical (Brazma et al., 1998). To use statistical patterns to predict the family membership of protein sequences auto- matically, a threshold needs to be chosen so that sequences obtaining a better score than the threshold are predicted to belong to the family. Using a deterministic pattern, a sequence is predicted to belong to a family if it matches the pattern. In this work we concentrate on deterministic pat- terns, in particular on a restricted subset of regular expressions used in the PROSITE database of pro- tein family signatures (Hofmann et al., 1999). The aim of the signatures is twofold. Firstly, they are to be used for classi®cation, and thus they should produce few false positives and few false nega- tives. Secondly, they should describe biologically E-mail address of the corresponding author: wtaylor@nimr.mrc.ac.uk Abbreviations used: RMSD, root-mean-square deviation; ComPat, combined pattern. doi:10.1006/jmbi.2000.4211 available online at http://www.idealibrary.com on J. Mol. Biol. (2000) 304, 599±619 0022-2836/00/040599±21 $35.00/0 # 2000 Academic Press