1 PROFILES AND FUZZY K-NEAREST NEIGHBOR ALGORITHM FOR PROTEIN SECONDARY STRUCTURE PREDICTION RAJKUMAR BONDUGULA, OGNEN DUZLEVSKI, AND DONG XU * Digital Biology Laboratory, Department of Computer Science, University of Missouri-Columbia Columbia, MO 65211, USA We introduce a new approach for predicting the secondary structure of proteins using profiles and the Fuzzy K-Nearest Neighbor algorithm. K-Nearest Neighbor methods give relatively better performance than Neural Networks or Hidden Markov models when the query protein has few homologs in the sequence database to build sequence profile. Although the traditional K-Nearest Neighbor algorithms are a good choice for this situation, one of the difficulties in utilizing these techniques is that all the labeled sam ples are given equal importance while deciding the secondary structure class of the protein residue and once a class has been assigned to a residue, there is no indication of its confidence in a particular class. In this paper, we propose a system based on the Fuzzy K-Nearest Neighbor Algorithm that addresses the above-mentioned issues and the system outperforms earlier K-Nearest neighbor methods that use multiple sequence alignments. We also introduce a new distance measure to calculate the distance between two protein sequences, a new method to assign membership values to the Nearest Neighbors in each of the Helix , Strand and Coil classes. We also propose a novel heuristic based filter to smoothen the prediction. Particularly attractive feature of our filter is that it does not require retraining when new structures are added to the database. We have achieved a sustained three-state overall accuracy of 75.75% with our system. The software is available upon request. 1 Introduction The ability to predict the secondary structure of a protein from sequence alone is an important step in understanding the three dimensional structure of a protein and the function of a protein. Owing to the importance of protein secondary structure prediction, much attention has been given to this problem [4, 6-12, 14,16]. Of all the successful prediction methods, the most popular systems are based on Neural Network methods [16], Nearest Neighbor methods [7,10] and Hidden Markov Model methods [14]. Currently, the systems based on Neural Network methods are one of the most accurate of all prediction systems [16]. However, Neural Network methods have some drawbacks. Firstly, the black- box nature of Neural Networks makes it difficult to understand how the networks predict the structure. Secondly, the systems based on Neural Network methods and the Hidden Markov Models perform well if the query protein has many homologs in the database [6- 7]. On the other hand, the prediction systems based on Nearest Neighbor methods do not suffer from any of the above-mentioned drawbacks [10]. Also, the Nearest Neighbors methods are sub-optimal methods and the 1-NN rule is bounded above by no more than * Corresponding author. Dong Xu can be contacted at dong@cecs.missouri.edu.