A Pair-to-Pair Amino Acids Substitution Matrix and its Applications for Protein Structure Prediction Eran Eyal, 1y , * Milana Frenkel-Morgenstern, 2y Vladimir Sobolev, 1 and Shmuel Pietrokovski 2 1 Department of Plant Sciences, Weizmann Institute of Science, Rehovot 76100, Israel 2 Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 76100, Israel ABSTRACT We present a new structurally derived pair-to-pair substitution matrix (P2PMAT). This matrix is constructed from a very large amount of integrated high quality multiple sequence alignments (Blocks) and protein structures. It evalu- ates the likelihoods of all 160,000 pair-to-pair substi- tutions. P2PMAT matrix implicitly accounts for evo- lutionary conservation, correlated mutations, and residue–residue contact potentials. The usefulness of the matrix for structural predictions is shown in this article. Predicting protein residue–residue con- tacts from sequence information alone, by our method (P2PConPred) is particularly accurate in the protein cores, where it performs better than other basic contact prediction methods (increasing accu- racy by 25–60%). The method mean accuracy for pro- tein cores is 24% for 59 diverse families and 34% for a subset of proteins shorter than 100 residues. This is above the level that was recently shown to be suffi- cient to significantly improve ab initio protein struc- ture prediction. We also demonstrate the ability of our approach to identify native structures within large sets of (300–2000) protein decoys. On the basis of evolutionary information alone our method ranks the native structure in the top 0.3% of the decoys in 4/10 of the sets, and in 8/10 of sets the native struc- ture is ranked in the top 10% of the decoys. The method can, thus, be used to assist filtering wrong models, complimenting traditional scoring functions. Proteins 2007;67:142–153. V V C 2007 Wiley-Liss, Inc. Key words: contact prediction; correlated muta- tions; ab initio INTRODUCTION Predicting the folding pattern of proteins from their sequence is a key and intensively investigated problem in computational molecular biology. The strategy for the so- lution is chosen according to the structural knowledge available regarding the protein of interest and its family members. The most challenging situation is when the protein sequence cannot be related to known structures (ab initio structure prediction). Although considerable ad- vance was obtained in the recent years, 1–3 the accuracy of ab initio methods is still far behind that of the methods that rely on known structures. 4 To facilitate the develop- ment of scoring functions for modeling structures, several decoy sets of protein structures were built and recently organized. These sets allow the evaluation of scoring functions for structural modeling, irrespective of the con- formational sampling problem. The ultimate goal of any scoring function is to rank the native structure of the pro- tein as more stable than any decoy structure. Prediction of residue–residue contacts within proteins is a closely related problem to structural model evalua- tion. These two problems are intimately connected, as confident knowledge of only several contacts is sufficient to predict the overall fold of a protein, 1,5,6 and was shown to enable protein design. 7 Some attempts were also made in the last decade to de- velop computational methods for predicting contacts based on sequence information. Most of the methods use corre- lated mutation analysis to seek for pairs of sites with cova- rying amino acids. Contacting sites typically show a signal above the background noise when enough homologous sequences are known and correctly aligned. 8–11 Although correlations also exist between noncontacting sites, 12 corre- lated mutation analysis is a useful tool for the identification of protein residue–residue contacts, 9,10 fold recognition, 1,5,13 and protein–protein contact prediction. 14 Another way to improve contact prediction using sequence alignment data is to average the predictions for homologous individual proteins according to the alignment with each other. 15 Some methods use correlation coefficients calculated directly from multiple sequence alignments using substitu- tion matrices. 9 These are simple binary identity matrices, 16 knowledge based matrices such as McLachlan’s 11,17 or BLOSUM matrices, 18,19 biophysical complementarity ma- trix, 9 and other contact potential matrices. 20 Using multi- ple sequence alignments for structure prediction assumes that most proteins in the multiple alignment have the same overall structural fold and that in any two aligned columns most pairs of residues are either contacting or noncontacting. The P2PMAT matrix and the contact prediction program are freely available at http://ignmtest.ccbb.pitt.edu/p2pdocs/dist. y E. Eyal and M. Frenkel-Morgenstern contributed equally to this work. *Correspondence to: Eran Eyal, Department of Computational Biology, The University of Pittsburgh, BST3 BDG, 3501 Fifth Ave- nue, Pittsburgh, PA 15260. E-mail: eyal@ccbb.pitt.edu Received 6 March 2006; Accepted 16 August 2006 Published online 22 January 2007 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21223 V V C 2007 WILEY-LISS, INC. PROTEINS: Structure, Function, and Bioinformatics 67:142–153 (2007)