Statistical Analysis of Long-Range Interactions in Proteins Jinmiao Chen School of Computer Engineering Nanyang Technological University Singapore, 639798 pg05205549@ntu.edu.sg Narendra S. Chaudhari School of Computer Engineering Nanyang Technological University Singapore, 639798 asnarendra@ntu.edu.sg Abstract We carry out a comprehensive study of long-range in- teractions on a large data set of non-homologous proteins. Our study reveals that the long-range interactions between amino acids far apart are common in protein folding, and play an important role on the formation of secondary structure. Using residue-wise contact order(RWCO) to de- scribe long-range interactions, we further evaluate the ef- fect of long-range interactions on secondary structure pre- diction. We select six most popular prediction methods and collect their prediction results on the same set of proteins. All the six prediction methods show a significant negative correlation between prediction accuracy and RWCO. Key words: Long-range interactions, protein secondary structure, mutual information, residue-wise contact order, correlation coefficient. 1. Introduction One of the most important open problems in compu- tational biology concerns the computational prediction of the secondary structure of a protein given only the under- lying amino acid sequence. During the last few decades, much effort has been made toward solving this prob- lem, with various approaches including GOR (Garnier, Osguthorpe, and Robson) techniques [8], PHD(Profile network from HeiDelberg) method [4], nearest-neighbor methods [24] and support vector machines (SVMs) [11]. These methods are all based on a fixed-width win- dow around a particular amino acid of interest, and thus usually do not explicitly consider long-range in- teractions between distant amino acids. This weakness of local window approaches is often cited as the cur- rent limitation to accurate secondary structure predic- tion. In this paper, a comprehensive study is carried out on a large data set of non-homologous proteins. We statistically analyze the long-range interactions in proteins and study their effect on protein secondary structure prediction. 2. Data acquisition 2.1. EVA set EVA(EValuation of Automatic protein structure pre- diction) is a web-based server that provides a continu- ous, fully automated, and statistically significant analy- sis of structure prediction servers. Everyday, EVA down- loads the newest protein structures from Protein Data Bank (PDB) [2]. The structures are added to mySQL databases, sequences are extracted for every protein chain, and are sent to each prediction server by META-PredictProtein [16]. META-PP collects the results and sends them to EVA. The central EVA site at Columbia is available at http://cubic.bioc.columbia.edu/eva/. The EVA set, the largest sequence unique sub- set of PDB, fulfills the criteria that no pair in sub- set has more than 33% identical residues over more than 100 residues aligned. We extracted the EVA set from PDB based on the current EVA list with 3449 chains(http://cubic.bioc.columbia.edu/ eva/res/unique_list.html). All proteins in the EVA set were sequence unique, i.e. they do NOT have sig- nificant sequence similarity to proteins of known struc- ture (at the date of the submission of the new protein). The term significant sequence similarity refers to the fol- lowing operational definition: Two proteins A and B are considered to be significantly similar if we can safely pre- dict that B has the same structure as A by simply compar- ing their sequences. To extract 3D coordinates of C α atoms, together with secondary structure, we run the DSSP program [15] on all the PDB files of the EVA proteins, excluding those