Automatic Protein Structure Classification through Structural Fingerprinting Zeyar Aung Kian-Lee Tan Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 Email: zeyaraun, tankl @comp.nus.edu.sg Abstract In this paper, we present a new scheme named “CP- Mine” for automatic three-dimensional (3D) protein struc- ture classification using structural fingerprints. We repre- sent a 3D protein structure as a CPset, which is a set of inter-SSE contact patterns (CPs) existing in the protein. Sup- pose we have a database of protein structures whose class labels are already known, and suppose there are distinct protein structure classes in the database. For each class, we generate its fingerprint by mining the frequent CPsets from all the member protein structures belonging to this class. When we want to predict the class label of an unknown pro- tein, we also generate the CPset of this protein, and find the intersection between this CPset and the fingerprint of each protein structure class one by one. Then, the labels of the classes with the highest degree of intersection are re- turned as the answer. The proposed method is a pure clas- sification scheme in that any kind of structural comparison, alignment or searching is not needed to be performed. The preliminary experimental results shows that our method can classify the protein structures accurately and efficiently. 1. Introduction Three dimensional (3D) protein structure analysis is an important area in bioinformatics. Protein structure analysis involves such tasks as protein structure comparison, classifi- cation, modelling, and prediction. Among them, researchers have been developing numerous automated methods for structural comparison, modelling and prediction since the last decade. But, only a few automated structural classifi- cation methods have been proposed up until now. Instead, people rely on the manual and semi-automatic classifica- tion methods such as SCOP [12] and CATH [15]. Or, they use the traditional structural comparison or alignment meth- ods such as DALI [10] and VAST [7] for classification pur- pose. Because of the advancements in the laboratory methods to determine the 3D structures of the proteins, the sizes of the protein structure databases such as PDB [2] are growing at very rapid rates. Nowadays, to proteins are added to PDB everyday. Thus, determining the respective structural classes 1 of these new proteins by the manual or the semi- automatic methods becomes inadequate. Classifying a new protein through traditional structural comparison also be- comes inefficient because it involves the comparison of the new protein against all the proteins (or the representative set) in the large database. Although faster database search- ing methods such as ProtDex [1] are available, they are not designed specifically for classification, and thus still require quite some time to search through the entire database. Our objective is to develop an efficient, dedicated and au- tomated protein structure classifier without having to per- form any structural comparison or searching. We use data mining techniques to generate the structural fingerprints for the various structural classes. A fingerprint is a set of con- cise and representative features that are common to all the protein structures in a given class. When we want to clas- sify an unknown protein structure, we can use these finger- prints as the indicators for efficient and effective classifica- tion. Our proposed CPMine scheme can be briefly described as follows. Suppose we have a database of 3D protein struc- tures (in PDB format) whose class labels are already known. We use SCOP as the golden standard in assigning the class labels to the proteins. A class label of a protein is its SCOP designation – Class, Fold, Superfamily, or Family – depend- ing on the level of detail required. We represent each pro- tein structure in the database as a CPset. A CPset is a set of 1 The term ‘class’ in the machine learning literatures refers to a group in general. It should not be confused with the term ‘Class’ in SCOP hierarchy. Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04) 0-7695-2173-8/04 $ 20.00 © 2004 IEEE