Using Motif-Based Methods in Multiple Genome Analyses: A Case Study Comparing Orthologous Mesophilic and Thermophilic Proteins ² David La, Melanie Silver, Robert C. Edgar, § and Dennis R. Livesay* ,‡ Department of Chemistry, California State Polytechnic UniVersity at Pomona, 3801 West Temple AVenue, Pomona, California 91768, and 195 Roque Moraes DriVe, Mill Valley, California 94941 ReceiVed December 31, 2002; ReVised Manuscript ReceiVed May 23, 2003 ABSTRACT: Protein motifs represent highly conserved regions within protein families and are generally accepted to describe critical regions required for protein stability and/or function. In this comprehensive analysis, we present a robust, unique approach to identify and compare corresponding mesophilic and thermophilic sequence motifs between all orthologous proteins within 44 microbial genomes. Motif similarity is determined through global sequence alignment of mesophilic and thermophilic motif pairs, which are identified by a greedy algorithm. Our results reveal only modest correlation between motif and overall sequence similarity, highlighting the rationale of motif-based approaches in comprehensive multigenome comparisons. Conserved mutations reflect previously suggested physiochemical principles for conferring thermostability. Additionally, comparisons between corresponding mesophilic and ther- mophilic motif pairs provide key biochemical insights related to thermostability and can be used to test the evolutionary robustness of individual structural comparisons. We demonstrate the ability of our unique approach to provide key insights in two examples: the TATA-box binding protein and glutamate dehydrogenase families. In the latter example, conserved mutations hint at novel origins leading to structural stability differences within the hexamer structures. Additionally, we present amino acid composition data and average protein length comparisons for all 44 microbial genomes. Proteins that function under standard (mesophilic) condi- tions tend to have similar structural stabilities, despite having different sequences and structural folds (1, 2). Several organisms, mostly archaea, thrive under extreme environ- mental conditions, e.g., high pressure, high salt concentra- tions, very high and low temperatures, and extreme pH. Enzymes that function optimally in such adverse conditions mediate the metabolic and biological functions of these organisms. Proteins from thermophilic (extremely high ambient temperatures) organisms generally exhibit substan- tially higher intrinsic thermal stabilities than their mesophilic counterparts while retaining the basic fold characteristics of the whole family (3). Although the molecular underpinnings of protein thermal stabilization have been the focus of many experimental and theoretical research efforts (for a review see Vielle et al.), the subject is only partially understood (3, 4). In general, it is thought that thermostability is achieved by an increase in the type and numbers of noncovalent interactions (5). Analyses of all noncovalent interactions within thermophilic and mesophilic structural pairs reveal that thermophilic proteins generally have increased numbers of van der Waals interactions, hydrogen bonds, salt bridges, dipole-dipole interactions, disulfide bridges, and hydrophobic interactions (5-18). Other differences include shortening of loop regions, fewer and smaller destabilizing voids within the protein, increased structural water content, and increased incidence of ion binding (16, 19-21). Increased conformational rigidity of the protein structure and optimization of the surface electrostatics also appear to parallel thermostability (22- 28). The secondary structure propensity of each amino acid within R-helices and -sheets has also been demonstrated to be linked to added stability (29, 30). Despite key differences between mesophilic and thermophilic structural pairs, the overall fold and the active site of the protein generally remain unchanged (31). To overcome the lack of abundant structural data for orthologous mesophilic and thermophilic protein pairs, Chakaravarty et al. have created high quality homology models taken from 30 complete bacterial genomes (nine of which are thermophilic) (32). This study identifies several statistically significant, specific amino acid substitutions, significantly more salt bridges in thermophiles, a slight decrease in loop length, and an increase in previously overlooked cation-π interactions. Additionally, statistically significant hydrophobic amino acid substitutions are reported to be consistent with decreased side chain conformational entropy. Several studies have concentrated on sequence analysis to investigate the origins of thermostability. Much of this work has focused on differences in amino acid composition between mesophilic and thermophilic genomes. It has been observed that arginine and tyrosine are significantly more ² This work was supported by an American Chemical Society Petroleum Research Fund type G grant (36848-GB4), an NIH score grant (S06 GM53933), and a supercomputer allocation from the National Center for Supercomputing Applications to D.R.L. * Corresponding author. Telephone: 909-869-4409. E-mail: drlivesay@csupomona.edu. California State Polytechnic University at Pomona. § Roque Moraes Drive, Mill Valley, CA. 8988 Biochemistry 2003, 42, 8988-8998 10.1021/bi027435e CCC: $25.00 © 2003 American Chemical Society Published on Web 07/09/2003