Automatic Protein Structure Classiﬁcation through Structural Fingerprinting Zeyar Aung Kian-Lee Tan Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 Email: zeyaraun, tankl @comp.nus.edu.sg Abstract In this paper, we present a new scheme named “CP- Mine” for automatic three-dimensional (3D) protein struc- ture classiﬁcation using structural ﬁngerprints. We repre- sent a 3D protein structure as a CPset, which is a set of inter-SSE contact patterns (CPs) existing in the protein. Sup- pose we have a database of protein structures whose class labels are already known, and suppose there are distinct protein structure classes in the database. For each class, we generate its ﬁngerprint by mining the frequent CPsets from all the member protein structures belonging to this class. When we want to predict the class label of an unknown pro- tein, we also generate the CPset of this protein, and ﬁnd the intersection between this CPset and the ﬁngerprint of each protein structure class one by one. Then, the labels of the classes with the highest degree of intersection are re- turned as the answer. The proposed method is a pure clas- siﬁcation scheme in that any kind of structural comparison, alignment or searching is not needed to be performed. The preliminary experimental results shows that our method can classify the protein structures accurately and efﬁciently. 1. Introduction Three dimensional (3D) protein structure analysis is an important area in bioinformatics. Protein structure analysis involves such tasks as protein structure comparison, classiﬁ- cation, modelling, and prediction. Among them, researchers have been developing numerous automated methods for structural comparison, modelling and prediction since the last decade. But, only a few automated structural classiﬁ- cation methods have been proposed up until now. Instead, people rely on the manual and semi-automatic classiﬁca- tion methods such as SCOP [12] and CATH [15]. Or, they use the traditional structural comparison or alignment meth- ods such as DALI [10] and VAST [7] for classiﬁcation pur- pose. Because of the advancements in the laboratory methods to determine the 3D structures of the proteins, the sizes of the protein structure databases such as PDB [2] are growing at very rapid rates. Nowadays, to proteins are added to PDB everyday. Thus, determining the respective structural classes 1 of these new proteins by the manual or the semi- automatic methods becomes inadequate. Classifying a new protein through traditional structural comparison also be- comes inefﬁcient because it involves the comparison of the new protein against all the proteins (or the representative set) in the large database. Although faster database search- ing methods such as ProtDex [1] are available, they are not designed speciﬁcally for classiﬁcation, and thus still require quite some time to search through the entire database. Our objective is to develop an efﬁcient, dedicated and au- tomated protein structure classiﬁer without having to per- form any structural comparison or searching. We use data mining techniques to generate the structural ﬁngerprints for the various structural classes. A ﬁngerprint is a set of con- cise and representative features that are common to all the protein structures in a given class. When we want to clas- sify an unknown protein structure, we can use these ﬁnger- prints as the indicators for efﬁcient and effective classiﬁca- tion. Our proposed CPMine scheme can be brieﬂy described as follows. Suppose we have a database of 3D protein struc- tures (in PDB format) whose class labels are already known. We use SCOP as the golden standard in assigning the class labels to the proteins. A class label of a protein is its SCOP designation – Class, Fold, Superfamily, or Family – depend- ing on the level of detail required. We represent each pro- tein structure in the database as a CPset. A CPset is a set of 1 The term ‘class’ in the machine learning literatures refers to a group in general. It should not be confused with the term ‘Class’ in SCOP hierarchy. Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04) 0-7695-2173-8/04 $ 20.00 © 2004 IEEE