Automatic Protein Structure Classification through
Structural Fingerprinting
Zeyar Aung Kian-Lee Tan
Department of Computer Science
National University of Singapore
3 Science Drive 2, Singapore 117543
Email: zeyaraun, tankl @comp.nus.edu.sg
Abstract
In this paper, we present a new scheme named “CP-
Mine” for automatic three-dimensional (3D) protein struc-
ture classification using structural fingerprints. We repre-
sent a 3D protein structure as a CPset, which is a set of
inter-SSE contact patterns (CPs) existing in the protein. Sup-
pose we have a database of protein structures whose class
labels are already known, and suppose there are distinct
protein structure classes in the database. For each class, we
generate its fingerprint by mining the frequent CPsets from
all the member protein structures belonging to this class.
When we want to predict the class label of an unknown pro-
tein, we also generate the CPset of this protein, and find
the intersection between this CPset and the fingerprint of
each protein structure class one by one. Then, the labels of
the classes with the highest degree of intersection are re-
turned as the answer. The proposed method is a pure clas-
sification scheme in that any kind of structural comparison,
alignment or searching is not needed to be performed. The
preliminary experimental results shows that our method can
classify the protein structures accurately and efficiently.
1. Introduction
Three dimensional (3D) protein structure analysis is an
important area in bioinformatics. Protein structure analysis
involves such tasks as protein structure comparison, classifi-
cation, modelling, and prediction. Among them, researchers
have been developing numerous automated methods for
structural comparison, modelling and prediction since the
last decade. But, only a few automated structural classifi-
cation methods have been proposed up until now. Instead,
people rely on the manual and semi-automatic classifica-
tion methods such as SCOP [12] and CATH [15]. Or, they
use the traditional structural comparison or alignment meth-
ods such as DALI [10] and VAST [7] for classification pur-
pose.
Because of the advancements in the laboratory methods
to determine the 3D structures of the proteins, the sizes of
the protein structure databases such as PDB [2] are growing
at very rapid rates. Nowadays, to proteins are added to
PDB everyday. Thus, determining the respective structural
classes
1
of these new proteins by the manual or the semi-
automatic methods becomes inadequate. Classifying a new
protein through traditional structural comparison also be-
comes inefficient because it involves the comparison of the
new protein against all the proteins (or the representative
set) in the large database. Although faster database search-
ing methods such as ProtDex [1] are available, they are not
designed specifically for classification, and thus still require
quite some time to search through the entire database.
Our objective is to develop an efficient, dedicated and au-
tomated protein structure classifier without having to per-
form any structural comparison or searching. We use data
mining techniques to generate the structural fingerprints for
the various structural classes. A fingerprint is a set of con-
cise and representative features that are common to all the
protein structures in a given class. When we want to clas-
sify an unknown protein structure, we can use these finger-
prints as the indicators for efficient and effective classifica-
tion.
Our proposed CPMine scheme can be briefly described
as follows. Suppose we have a database of 3D protein struc-
tures (in PDB format) whose class labels are already known.
We use SCOP as the golden standard in assigning the class
labels to the proteins. A class label of a protein is its SCOP
designation – Class, Fold, Superfamily, or Family – depend-
ing on the level of detail required. We represent each pro-
tein structure in the database as a CPset. A CPset is a set of
1 The term ‘class’ in the machine learning literatures refers to a group
in general. It should not be confused with the term ‘Class’ in SCOP
hierarchy.
Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04)
0-7695-2173-8/04 $ 20.00 © 2004 IEEE