Functional Grouping Based on Signatures in Protein
Termini
Iris Bahir and Michal Linial
*
Department of Biological Chemistry, Institute of life Sciences, The Hebrew University of Jerusalem, Israel
ABSTRACT The two ends of each protein are
known as the amino (N-) and carboxyl (C-) termini.
Short signatures in a protein’s termini often carry
vital cellular function. No systematic research has
been conducted to address the importance of short
signatures (3 to 10 amino acids) in protein termini at
the proteomic level. Specifically, it is unknown
whether such signatures are evolutionarily con-
served, and if so, whether this conservation confers
shared biological functions. Current signature detec-
tion methods fail to detect such short signatures due
to inadequate statistical scores. The findings pre-
sented in this study strongly support the notion that
functional significance of protein sets may be cap-
tured by short signatures at their termini. A posi-
tional search method was applied to over one mil-
lion proteins from the UniProt database. The result
is a collection of about a thousand significant signa-
ture groups (SIGs) that include previously identified
as well as many novel signatures in protein termini.
These SIGs represent protein sets with minimal or
no overall sequence similarity excepting the similar-
ity at their termini. The most significant SIGs are
assigned by their strong correspondence to func-
tional annotations derived from external databases
such as Gene Ontology. Each of the SIGs is associ-
ated with the statistical significance of its func-
tional association. These SIGs provide a valuable
source for testing previously overlooked signatures
in protein termini and allow for the investigation of
the role played by such signatures throughout evolu-
tion. The SIGs archive and advanced search options
are available at http://www.proteus.cs.huji.ac.il.
Proteins 2006;63:996 –1004. © 2006 Wiley-Liss, Inc.
Key words: sequence similarity; function classifica-
tion; bioinformatics; protein signature
INTRODUCTION
Protein sequences are linear polymers of amino acids
(aa) of varying length. At the ends of each protein are the
amino (N-) and carboxyl (C-) termini. The distinct biochem-
istry at the N- and C-termini dictates the polarity of the
protein and thus contributes to its folding kinetics. The
energetic cost of burying the terminal inside the hydropho-
bic core may be too high in a typical protein. Therefore,
most proteins have their terminal regions exposed and
accessible for biochemical reactions with the surroundings
(i.e., lipids, metal ions) or with partner proteins (i.e.,
interactions that lead to posttranslational modifications).
Moreover, the biochemical distinct nature at each terminal
provides positional and chemical uniqueness that is used
to execute a large repertoire of biological processes. Over
the years, several examples were studied in great details,
emphasizing the unique nature of protein tails to carry
information.
1
Interestingly, most signatures that were
documented are related to protein localization and life-
time and intracellular trafficking. For example, a large
number of proteins that participate in ubiquitination-
induced degradation
2
carry a recognition signature at
their N-terminal. In addition, most well-established pro-
tein sorting signals are located at the N-terminal.
3
Organi-
zation of protein complexes in the membranes of multicel-
lular organisms is executed by the binding of a protein
domain called PDZ to short tails in the C-terminal se-
quences of a large number of channels, receptors, and
signaling molecules.
4,5
To determine the nature of the PDZ
binding signature, random peptide libraries were screened
and the specificity rules have been determined.
6
Another
instance of a short signature that is shared among a large
numbers of proteins throughout evolution is the KDEL
signature. KDEL resides at the C-terminal and specifies a
recognition endoplasmatic reticulum (ER) retention and/or
retrieval signal.
7
Experimental tests confirmed that the
targeting information is entirely included in this signa-
ture. Adding the KDEL to the C-terminal (but not to other
parts of the protein) convey the ER retention property to
unrelated proteins.
8
From the above listed examples, it is
clear that even a very short linear signature of three to
four amino acids has the potential to dominate a vital cell
biological process.
Protein signatures are detected by a wide variety of
methods. Most methods imply initial multiple sequence
alignment (MSA) of a selected family of proteins (often
referred as the “seed alignment”) as the basis for construct-
ing Position Specific Scoring Matrix (PSSM) profiles, Hid-
den Markov Models (HMMs), regular expression based
Grant sponsor: the Sudarsky Center for Computational Biology in
the Hebrew University of Jerusalem (to I.B.).
M. Linial’s present address is the Department of Computer Science
& Engineering, University of Washington, Seattle, WA 98195.
*Correspondence to: Michal Linial, Department of Biological Chem-
istry, Institute of life Sciences, The Hebrew University of Jerusalem,
91904 Israel. E-mail: michall@cc.huji.ac.il
Received 15 July 2005; Revised 17 October 2005; Accepted 27
October 2005
Published online 10 February 2006 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/prot.20903
PROTEINS: Structure, Function, and Bioinformatics 63:996 –1004 (2006)
© 2006 WILEY-LISS, INC.