A comprehensive system for identifying internal repeat
substructures of proteins
Hua-Ying Kao, Tsang-Huang Shih, Tun-Wen Pai
Dept. of Computer Science and Engineering
National Taiwan Ocean University,
Keelung, Taiwan, R.O.C.
e-mail: twp@mail.ntou.edu.tw
Ming-Da Lu, Hui-Huang Hsu
Dept. of Computer Science and
Information Engineering,
Tamkang University, Taipei, Taiwan, R.O.C.
e-mail: hhsu@cs.tku.edu.tw
Abstract—Repetitive substructures within a protein play an
important role in understanding protein folding and stability,
biological function, and genome evolution. About 25% of all
proteins contain repeat structures for eukaryote species and most
of them do not have the resolved structural information yet.
Therefore, this study aimed to design a comprehensive system for
identifying internal repeats either from a protein sequence or
structural information. In this study, we have curated a set of
internal repeat units as a benchmark dataset for performing both
sequence and structural alignment with respect to the query
sequence or structure. Except for the traditional BLAST
algorithms on amino acid sequence or the optimal structural
superposition approaches on structures, a novel method
employing the predicted secondary structure element information
for internal repeat identification was proposed. Sequences were
firstly transformed into Length Encoded Secondary Structure
(LESS) profiles and followed by autocorrelation analyses. From
the primary experimental results, the developed Internal Repeat
Identification System (IRIS) can successfully identify internal
repeats from those known protein structures, and the web system
is freely available at http://iris.cs.ntou.edu.tw/ .
Keywords- internal repeat unit; secondary structure element;
sequence alignment; structure alignment; Length Encoded
Secondary Structure; solenoid
I. INTRODUCTION
Protein repeats were roughly classified into three
different types according to the length of a repeat unit. The
shortest repeats contain less than 4 residues within a repeat
unit and form crystalline or fibrous structures. The second
type of repeat is with unit length shorter than 45 residues
and called as solenoid proteins which contain secondary
structure elements within the repeat unit and the repeat units
are coiled alone with a common axis or a specific direction
sequentially in spatial domain. The third type represents a
basic repeat unit possessing its length longer than 45
residues and forms a protein domain itself within a repeat
structure[1]. Due to important biological features and
particular construction frameworks of protein structures
with internal repeats, increasing interest has recently been
devoted to the study on detection of protein repeats. A
repetitive substructure within a protein plays an important
role in understanding protein folding and stability,
biological function, and genome evolution [2] [3] [4]. These
studies indicated that novel genes were evolved through
duplications and transitions from existing genes within
proteins possessing regular secondary structures and
functional units[5], and the stability and repetition of
structural unit directly reflected the structural and
biophysical properties of proteins[6]. For example, different
alleles of the fungus Podospora anserine possess different
numbers of WD40 (WD or beta-transducin repeat)
repeats[7]. The analysis of conserved cores of internal
repeats often occur symmetric units on structures, such as
the protein phophatase 2A PR65 (HEAT), a superhelix with
repeats [8].
Over the last two decades, a bunch of tools were
developed for repeat sequence and structure detection.
Several implementations were designed at the DNA level,
such as Reputer[9], CGSSR[10], Repseek[11], while
Swelfe[12], REPRO[13], and REPETITA[14] performed at
the structure level. Traditional approaches for identifying
internal repeats were based on sequence alignment strategies,
especially when coping with protein sequences without
resolved protein structures. It is not easy to predict the
internal repeats within a protein since the highly varied
residue contents usually occurred for the identical
substructures within a protein. However, it becomes
relatively simple when a protein structure is known for
quantitative analysis of repetitive composition.
The sequence alignment approaches were satisfied
only confronting with sequences with high similarity and
regularity. Such alignment tools combined with
comprehensive genomic databases of various species can be
efficiently applied to identify homologous sequences which
possessing with known repetitive information. These well-
known tools include PAM [15], BLAST[16], PSI-
BLAST[17], and ClustalW[18]. However, if protein
segments possess low sequence similarity, then sequence
alignment based methods become invalid for internal repeat
detection. Unfortunately, it has been verified that sequence
contents of repeat structure units of most proteins with
internal repeats are highly diverged among all various
species. Hence, the structural information of secondary
structure information provides an alternative way to analyze
and predict the locations of repeat segments within a protein,
since the secondary structures always possess highly
2010 International Conference on Complex, Intelligent and Software Intensive Systems
978-0-7695-3967-6/10 $26.00 © 2010 IEEE
DOI 10.1109/CISIS.2010.92
689