BATMAS30: Amino Acid Substitution Matrix for Alignment of Bacterial Transporters Roman A. Sutormin, 1 * Aleksandra B. Rakhmaninova, 2 and Mikhail S. Gelfand 1,2 1 State Scientific Center GosNIIGenetica, Moscow, Russia 2 Integrated Genomics, Moscow, Russia ABSTRACT Aligned amino acid sequences of three functionally independent samples of trans- membrane (TM) transport proteins have been ana- lyzed. The concept of TM-kernel is proposed as the most probable transmembrane region of a sequence. The average amino acid composition of TM-kernels differs from the published amino acid composition of transmembrane segments. TM-kernels contain more alanines, glycines, and less polar, charged, and aromatic residues in contrast to non-TM-proteins. There are also differences between TM-kernels of bacterial and eukaryotic proteins. We have con- structed amino acid substitution matrices for bacte- rial TM-kernels, named the BATMAS (BActerial Transmembrane MAtrix of Substitutions) series. In TM-kernels, polar and charged residues, as well as proline and tyrosine, are highly conserved, whereas there are more substitutions within the group of hydrophobic residues, in contrast to non-TM-pro- teins that have fewer, relatively more conserved, hydrophobic residues. These results demonstrate that alignment of transmembrane proteins should be based on at least two amino acid substitution matrices, one for loops (e.g., the BLOSUM series) and one for TM-segments (the BATMAS series), and the choice of the TM-matrix should be different for eukaryotic and bacterial proteins. Proteins 2003; 51:85–95. © 2003 Wiley-Liss, Inc. Key words: comparative analysis; transport pro- teins; amino acid substitution matrix; evolution; transmembrane segments INTRODUCTION The growth of databases describing various characteris- tics of proteins, such as amino acid sequence, spatial structure, function, functional domains, etc., allows one to describe new proteins, at least at the first approximation, comparing the sequences under analysis to already known ones. Most comparative techniques involve alignment of amino acid sequences that, in turn, depends on amino acid substitution matrices. Thus, it is crucial to develop ad- equate substitution matrices for different functional re- gions of proteins. The best known and the most commonly used substitu- tion matrices are the BLOSUM and PAM series, obtained by statistical analysis of large samples of amino acid sequences. 1,2 It becomes increasingly clear that in order to align proteins with non-standard physical and chemical characteristics and amino acid composition, specific matri- ces are required. Among such proteins is the group of transmembrane (TM) hydrophobic proteins. The idea that transmembrane proteins should be aligned using two different matrices at the same time, one for hydrophobic membrane segments and the other for hydrophilic loops, was repeatedly discussed. TM-specific scoring matrices derived using PHDhtm, an algorithm predicting TM- segments in multiple alignment by neural networks, were published in Ng et al. 3 and Muller et al. 4 A substitution matrix for highly homologous TM-proteins based on SwissProt annotations was constructed, 5 and then the Dayhoff mutation model was applied to derive matrices for comparison of more distant proteins. In all these studies, bacterial and eukaryotic proteins were combined into a single sample. As it will be shown below, statistical properties of bacterial and eukaryotic TM-segments differ and thus the transmembrane proteins of eubacterial and eukariotic origin should be considered separately. The main problem arising during construction of substi- tution or score matrices for transmembrane proteins is the fact that in most cases it is not known what part of a protein actually resides within the membrane. The reason is that transmembrane proteins crystallize poorly, and thus only a few such proteins have known spatial struc- tures determined by the X-ray analysis. 6,7 Different meth- ods for prediction of transmembrane segments yield contra- dictory results when applied to the same sequence; for a typical example see Figure 1. At the same time, a large number of known transmem- brane proteins allows one to apply the comparative analy- sis for verification of predicted TM-segments using various criteria of consistency. A somewhat similar approach was used to predict TM-segments by consensus methods. 8,9 We use two criteria: agreement between five different TM- Grant sponsor: Howard Hughes Medical Institute; Grant number: 55000309; Grant sponsor: INTAS; Grant number: 99-1476; Grant sponsor: Ludwig Institute for Cancer Research; Grant number: CRDF RB0-1268; Grant sponsor: Russian Fund of Basic Research; Grant number: 00-15-99362. *Correspondence to: Roman A. Sutormin, State Scientific Center GosNIIGenetica, Moscow, 113545, Russia. E-mail: sutor_ra@mail.ru Received 15 March 2002; Accepted 12 September 2002 PROTEINS: Structure, Function, and Genetics 51:85–95 (2003) © 2003 WILEY-LISS, INC.