BATMAS30: Amino Acid Substitution Matrix for Alignment
of Bacterial Transporters
Roman A. Sutormin,
1
*
Aleksandra B. Rakhmaninova,
2
and Mikhail S. Gelfand
1,2
1
State Scientific Center GosNIIGenetica, Moscow, Russia
2
Integrated Genomics, Moscow, Russia
ABSTRACT Aligned amino acid sequences of
three functionally independent samples of trans-
membrane (TM) transport proteins have been ana-
lyzed. The concept of TM-kernel is proposed as the
most probable transmembrane region of a sequence.
The average amino acid composition of TM-kernels
differs from the published amino acid composition
of transmembrane segments. TM-kernels contain
more alanines, glycines, and less polar, charged, and
aromatic residues in contrast to non-TM-proteins.
There are also differences between TM-kernels of
bacterial and eukaryotic proteins. We have con-
structed amino acid substitution matrices for bacte-
rial TM-kernels, named the BATMAS (BActerial
Transmembrane MAtrix of Substitutions) series. In
TM-kernels, polar and charged residues, as well as
proline and tyrosine, are highly conserved, whereas
there are more substitutions within the group of
hydrophobic residues, in contrast to non-TM-pro-
teins that have fewer, relatively more conserved,
hydrophobic residues. These results demonstrate
that alignment of transmembrane proteins should
be based on at least two amino acid substitution
matrices, one for loops (e.g., the BLOSUM series)
and one for TM-segments (the BATMAS series), and
the choice of the TM-matrix should be different for
eukaryotic and bacterial proteins. Proteins 2003;
51:85–95. © 2003 Wiley-Liss, Inc.
Key words: comparative analysis; transport pro-
teins; amino acid substitution matrix;
evolution; transmembrane segments
INTRODUCTION
The growth of databases describing various characteris-
tics of proteins, such as amino acid sequence, spatial
structure, function, functional domains, etc., allows one to
describe new proteins, at least at the first approximation,
comparing the sequences under analysis to already known
ones. Most comparative techniques involve alignment of
amino acid sequences that, in turn, depends on amino acid
substitution matrices. Thus, it is crucial to develop ad-
equate substitution matrices for different functional re-
gions of proteins.
The best known and the most commonly used substitu-
tion matrices are the BLOSUM and PAM series, obtained
by statistical analysis of large samples of amino acid
sequences.
1,2
It becomes increasingly clear that in order to
align proteins with non-standard physical and chemical
characteristics and amino acid composition, specific matri-
ces are required. Among such proteins is the group of
transmembrane (TM) hydrophobic proteins. The idea that
transmembrane proteins should be aligned using two
different matrices at the same time, one for hydrophobic
membrane segments and the other for hydrophilic loops,
was repeatedly discussed. TM-specific scoring matrices
derived using PHDhtm, an algorithm predicting TM-
segments in multiple alignment by neural networks, were
published in Ng et al.
3
and Muller et al.
4
A substitution
matrix for highly homologous TM-proteins based on
SwissProt annotations was constructed,
5
and then the
Dayhoff mutation model was applied to derive matrices for
comparison of more distant proteins. In all these studies,
bacterial and eukaryotic proteins were combined into a
single sample. As it will be shown below, statistical
properties of bacterial and eukaryotic TM-segments differ
and thus the transmembrane proteins of eubacterial and
eukariotic origin should be considered separately.
The main problem arising during construction of substi-
tution or score matrices for transmembrane proteins is the
fact that in most cases it is not known what part of a
protein actually resides within the membrane. The reason
is that transmembrane proteins crystallize poorly, and
thus only a few such proteins have known spatial struc-
tures determined by the X-ray analysis.
6,7
Different meth-
ods for prediction of transmembrane segments yield contra-
dictory results when applied to the same sequence; for a
typical example see Figure 1.
At the same time, a large number of known transmem-
brane proteins allows one to apply the comparative analy-
sis for verification of predicted TM-segments using various
criteria of consistency. A somewhat similar approach was
used to predict TM-segments by consensus methods.
8,9
We
use two criteria: agreement between five different TM-
Grant sponsor: Howard Hughes Medical Institute; Grant number:
55000309; Grant sponsor: INTAS; Grant number: 99-1476; Grant
sponsor: Ludwig Institute for Cancer Research; Grant number: CRDF
RB0-1268; Grant sponsor: Russian Fund of Basic Research; Grant
number: 00-15-99362.
*Correspondence to: Roman A. Sutormin, State Scientific Center
GosNIIGenetica, Moscow, 113545, Russia. E-mail: sutor_ra@mail.ru
Received 15 March 2002; Accepted 12 September 2002
PROTEINS: Structure, Function, and Genetics 51:85–95 (2003)
© 2003 WILEY-LISS, INC.