Amino Acid Sequence Autocorrelation Vectors and Bayesian-Regularized Genetic Neural Networks for Modeling Protein Conformational Stability: Gene V Protein Mutants Leyden Ferna ´ ndez, 1 Julio Caballero, 1 Jose ´ Ignacio Abreu, 1,2 and Michael Ferna ´ ndez 1 * 1 Molecular Modeling Group, Center for Biotechnological Studies, Faculty of Agronomy, University of Matanzas, 44740 Matanzas, Cuba 2 Artificial Intelligence Lab, Faculty of Informatics, University of Matanzas, 44740 Matanzas, Cuba ABSTRACT Development of novel computa- tional approaches for modeling protein properties from their primary structure is the main goal in applied proteomics. In this work, we reported the extension of the autocorrelation vector formalism to amino acid sequences for encoding protein structural information with modeling purposes. Amino acid sequence autocorrelation (AASA) vec- tors were calculated by measuring the autocorre- lations at sequence lags ranging from 1 to 15 on the protein primary structure of 48 amino acid/ residue properties selected from the AAindex data base. A total of 720 AASA descriptors were tested for building predictive models of the change of thermal unfolding Gibbs free energy change (DDG) of gene V protein upon mutation. In this sense, ensembles of Bayesian-regularized genetic neural networks (BRGNNs) were used for obtaining an optimum nonlinear model for the conformational stability. The ensemble predictor described about 88% and 66% variance of the data in training and test sets respectively. Furthermore, the optimum AASA vector subset not only helped to successfully model unfolding stability but also well distributed wild-type and gene V protein mutants on a stabil- ity self-organized map (SOM), when used for unsu- pervised training of competitive neurons. Proteins 2007;67:834–852. V V C 2007 Wiley-Liss, Inc. Key words: protein stability prediction; point mutations; Bayesian regularization; artificial neural networks; genetic algorithm INTRODUCTION Evidence is accumulating that many disease-causing mutations exert their effects by altering protein folding. Predicting protein structures and stability is a funda- mental goal in molecular biology. Even predicting changes in structure and stability induced by point mutations has immediate application in computational protein design. 1–4 Although free energy simulations have accurate predicted relative stabilities of point mutants, 5 the computational cost that most of the methods actually demand are extremely high to test the large number of mutations studied in protein design applications. Translation of structural data into energetic parame- ters is intended today by developing fast algorithms for protein energy calculations. However, the development of fast and reliable protein force-fields is a complex task due to the delicate balance between the different energy terms that contribute to protein stability. Force-fields for predicting protein stability can be divided in three main groups: physical effective energy function (PEEF), statis- tical potential-based effective energy function (SEEF) 6 and empirical data-based energy function (EEEF). Among the PEEF approach a simplified energy func- tion with only van der Waals and side chain torsion potentials 7 has been used to predict the stabilities of the k repressor protein for mutations involving only hydro- phobic residues. In addition, an improved optimization method including continuously flexible side chain angles also demonstrated better prediction accuracy as com- pared to discrete side chain angles from a rotamer library. 8 In turn SEEF method includes statistical poten- tials derived from geometric and environmental propen- sities and correlations of residues in X-ray crystal struc- tures. Potentials derived from substitution and occur- rence frequencies for amino acids in different structural environment classes, such as main chain conformations and solvent accessibilities, have also been used to calcu- late the stability differences induced by point muta- tions. 6,9,10 On the other hand, EEEF approach combines a physical description of the interactions with some data obtained from experiments previously ran on proteins. Examples of such algorithms are the helix/coil transition algorithm AGADIR 11,12 or FOLDEF, a fast and accurate EEEF approach based on AGADIR algorithm that uses a *Correspondence to: Michael Ferna ´ ndez, Molecular Modeling Group, Center for Biotechnological Studies, Faculty of Agronomy, University of Matanzas, 44740 Matanzas, Cuba. E-mail: michael.fernandez@umcc.cu Received 17 March 2006; Revised 28 September 2006; Accepted 8 November 2006 Published online 21 March 2007 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21349 V V C 2007 WILEY-LISS, INC. PROTEINS: Structure, Function, and Bioinformatics 67:834–852 (2007)