Functional Grouping Based on Signatures in Protein Termini Iris Bahir and Michal Linial * Department of Biological Chemistry, Institute of life Sciences, The Hebrew University of Jerusalem, Israel ABSTRACT The two ends of each protein are known as the amino (N-) and carboxyl (C-) termini. Short signatures in a protein’s termini often carry vital cellular function. No systematic research has been conducted to address the importance of short signatures (3 to 10 amino acids) in protein termini at the proteomic level. Specifically, it is unknown whether such signatures are evolutionarily con- served, and if so, whether this conservation confers shared biological functions. Current signature detec- tion methods fail to detect such short signatures due to inadequate statistical scores. The findings pre- sented in this study strongly support the notion that functional significance of protein sets may be cap- tured by short signatures at their termini. A posi- tional search method was applied to over one mil- lion proteins from the UniProt database. The result is a collection of about a thousand significant signa- ture groups (SIGs) that include previously identified as well as many novel signatures in protein termini. These SIGs represent protein sets with minimal or no overall sequence similarity excepting the similar- ity at their termini. The most significant SIGs are assigned by their strong correspondence to func- tional annotations derived from external databases such as Gene Ontology. Each of the SIGs is associ- ated with the statistical significance of its func- tional association. These SIGs provide a valuable source for testing previously overlooked signatures in protein termini and allow for the investigation of the role played by such signatures throughout evolu- tion. The SIGs archive and advanced search options are available at http://www.proteus.cs.huji.ac.il. Proteins 2006;63:996 –1004. © 2006 Wiley-Liss, Inc. Key words: sequence similarity; function classifica- tion; bioinformatics; protein signature INTRODUCTION Protein sequences are linear polymers of amino acids (aa) of varying length. At the ends of each protein are the amino (N-) and carboxyl (C-) termini. The distinct biochem- istry at the N- and C-termini dictates the polarity of the protein and thus contributes to its folding kinetics. The energetic cost of burying the terminal inside the hydropho- bic core may be too high in a typical protein. Therefore, most proteins have their terminal regions exposed and accessible for biochemical reactions with the surroundings (i.e., lipids, metal ions) or with partner proteins (i.e., interactions that lead to posttranslational modifications). Moreover, the biochemical distinct nature at each terminal provides positional and chemical uniqueness that is used to execute a large repertoire of biological processes. Over the years, several examples were studied in great details, emphasizing the unique nature of protein tails to carry information. 1 Interestingly, most signatures that were documented are related to protein localization and life- time and intracellular trafficking. For example, a large number of proteins that participate in ubiquitination- induced degradation 2 carry a recognition signature at their N-terminal. In addition, most well-established pro- tein sorting signals are located at the N-terminal. 3 Organi- zation of protein complexes in the membranes of multicel- lular organisms is executed by the binding of a protein domain called PDZ to short tails in the C-terminal se- quences of a large number of channels, receptors, and signaling molecules. 4,5 To determine the nature of the PDZ binding signature, random peptide libraries were screened and the specificity rules have been determined. 6 Another instance of a short signature that is shared among a large numbers of proteins throughout evolution is the KDEL signature. KDEL resides at the C-terminal and specifies a recognition endoplasmatic reticulum (ER) retention and/or retrieval signal. 7 Experimental tests confirmed that the targeting information is entirely included in this signa- ture. Adding the KDEL to the C-terminal (but not to other parts of the protein) convey the ER retention property to unrelated proteins. 8 From the above listed examples, it is clear that even a very short linear signature of three to four amino acids has the potential to dominate a vital cell biological process. Protein signatures are detected by a wide variety of methods. Most methods imply initial multiple sequence alignment (MSA) of a selected family of proteins (often referred as the “seed alignment”) as the basis for construct- ing Position Specific Scoring Matrix (PSSM) profiles, Hid- den Markov Models (HMMs), regular expression based Grant sponsor: the Sudarsky Center for Computational Biology in the Hebrew University of Jerusalem (to I.B.). M. Linial’s present address is the Department of Computer Science & Engineering, University of Washington, Seattle, WA 98195. *Correspondence to: Michal Linial, Department of Biological Chem- istry, Institute of life Sciences, The Hebrew University of Jerusalem, 91904 Israel. E-mail: michall@cc.huji.ac.il Received 15 July 2005; Revised 17 October 2005; Accepted 27 October 2005 Published online 10 February 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20903 PROTEINS: Structure, Function, and Bioinformatics 63:996 –1004 (2006) © 2006 WILEY-LISS, INC.