Progress in Predicting Inter-Residue Contacts of Proteins With Neural Networks and Correlated Mutations Piero Fariselli, 1 Osvaldo Olmea, 2 Alfonso Valencia, 2 and Rita Casadio 1 * 1 CIRB and Department of Biology, University of Bologna, Bologna, Italy 2 Protein Design Group, CNB-CSIC Cantoblanco, Madrid, Spain ABSTRACT This article presents recent progress in predicting inter-residue contacts of pro- teins with a neural network-based method. Improve- ment over the results obtained at the previous CASP3 competition is attained by using as input to the network a complex code, which includes evolu- tionary information, sequence conservation, corre- lated mutations, and predicted secondary struc- tures. The predictor was trained and cross-validated on a data set comprising the contact maps of 173 non-homologous proteins as computed from their well-resolved three-dimensional structures. The method could assign protein contacts with an aver- age accuracy of 0.21 and with an improvement over a random predictor of a factor greater than 6, which is higher than that previously obtained with meth- ods only based either on neural networks or on correlated mutations. Although far from being ideal, these scores are the highest reported so far for predicting protein contact maps. On 29 targets auto- matically predicted by the server (CORNET) the average accuracy is 0.14. The predictor is poorly performing on all-proteins, not represented in the training set. On all-and mixed proteins (22 targets) the average accuracy is 0.16. This set comprises proteins of different complexity and different chain length, suggesting that the predictor is capable of generalization over a broad number of features. Proteins 2001;Suppl 5:157–162. © 2002 Wiley-Liss, Inc. Key words: protein structure predictions; contact maps; correlated mutations; neural net- works; residue contacts INTRODUCTION A useful two-dimensional representation of a protein three-dimensional (3D) structure is its contact map. 1 Secondary structures are easily detected from the contact map. -Helices appear as thick bands along the main diagonal involving contacts between residues in position i and i+4, respectively. Offset parallel or perpendicular bands to the main diagonal are distinguished marks of parallel or antiparallel -sheets. The remaining contacts in the representation are sparse and/or clustering in segregated areas, depending on the protein structural complexity. In real proteins, the number of contacts linearly scales with the chains length. 2–4 The slope of the linear depen- dence depends on the contact definition. 3 Various ways have been used to define contacts. Routinely, a contact is said to exist between each pair of residues whenever the mutual distance is below a given arbitrary threshold. The distance involved in the different definitions of a contact can be that between the C -C atoms, 3 between the C -C , 2,5,6 and the minimal distance between atoms belong- ing to the side chain or to the backbone of the two residues. 4 If the true physical contact map representation of a protein is known, it is possible to recover its 3D structure. The similarity to the native structure is still rather good [low root-mean-square deviation (RMSD) to the crystal] even when the number of true contacts is reduced by a factor of two. 3 A relevant issue is, therefore, whether it is possible to predict the contact map of a protein starting from the residue sequence and, most importantly, to which extent the prediction can be useful to reconstruct the protein structure. In this article we focus on the accuracy of the prediction of contact maps that can be obtained with our predictor (CORNET) and highlight some future perspec- tives for this ab initio procedure. MATERIALS AND METHODS We developed CORNET, a predictor that is essentially based on neural networks. The system was trained to learn the association rules between the covalent structure of each protein belonging to a selected database and its contact map. Complexity of the input coding, which is rather complex compared with others previously used for the same task, is new in the present version. CORNET was specifically designed to include evolutionary information in the form of sequence profile, sequence conservation, correlated mutations, and predicted secondary structures. We were prompted to modify the input coding by the results obtained at CASP3. 7 A brief description of the method is outlined below. Grant sponsor: Ministero della Universita ´ e della Ricerca Scientifica e Tecnologica; Grant sponsor: Italian Centro Nazionale delle Ricerche (Target: Biotechnology). O. Olmea’s present address is Department of Physiology and Biophysics, Mount Sinai School of Medicine, New York, NY. *Correspondence to: Rita Casadio, Department of Biology, Via Irnerio 42, I-40126, Bologna, Italy. E-mail: casadio@alma.unibo.it Received 27 March 2001; Accepted 2 July 2001 Published online 28 January 2002 PROTEINS: Structure, Function, and Genetics Suppl 5:157–162 (2001) DOI 10.1002/prot.1173 © 2002 WILEY-LISS, INC.