A comprehensive system for identifying internal repeat substructures of proteins Hua-Ying Kao, Tsang-Huang Shih, Tun-Wen Pai Dept. of Computer Science and Engineering National Taiwan Ocean University, Keelung, Taiwan, R.O.C. e-mail: twp@mail.ntou.edu.tw Ming-Da Lu, Hui-Huang Hsu Dept. of Computer Science and Information Engineering, Tamkang University, Taipei, Taiwan, R.O.C. e-mail: hhsu@cs.tku.edu.tw Abstract—Repetitive substructures within a protein play an important role in understanding protein folding and stability, biological function, and genome evolution. About 25% of all proteins contain repeat structures for eukaryote species and most of them do not have the resolved structural information yet. Therefore, this study aimed to design a comprehensive system for identifying internal repeats either from a protein sequence or structural information. In this study, we have curated a set of internal repeat units as a benchmark dataset for performing both sequence and structural alignment with respect to the query sequence or structure. Except for the traditional BLAST algorithms on amino acid sequence or the optimal structural superposition approaches on structures, a novel method employing the predicted secondary structure element information for internal repeat identification was proposed. Sequences were firstly transformed into Length Encoded Secondary Structure (LESS) profiles and followed by autocorrelation analyses. From the primary experimental results, the developed Internal Repeat Identification System (IRIS) can successfully identify internal repeats from those known protein structures, and the web system is freely available at http://iris.cs.ntou.edu.tw/ . Keywords- internal repeat unit; secondary structure element; sequence alignment; structure alignment; Length Encoded Secondary Structure; solenoid I. INTRODUCTION Protein repeats were roughly classified into three different types according to the length of a repeat unit. The shortest repeats contain less than 4 residues within a repeat unit and form crystalline or fibrous structures. The second type of repeat is with unit length shorter than 45 residues and called as solenoid proteins which contain secondary structure elements within the repeat unit and the repeat units are coiled alone with a common axis or a specific direction sequentially in spatial domain. The third type represents a basic repeat unit possessing its length longer than 45 residues and forms a protein domain itself within a repeat structure[1]. Due to important biological features and particular construction frameworks of protein structures with internal repeats, increasing interest has recently been devoted to the study on detection of protein repeats. A repetitive substructure within a protein plays an important role in understanding protein folding and stability, biological function, and genome evolution [2] [3] [4]. These studies indicated that novel genes were evolved through duplications and transitions from existing genes within proteins possessing regular secondary structures and functional units[5], and the stability and repetition of structural unit directly reflected the structural and biophysical properties of proteins[6]. For example, different alleles of the fungus Podospora anserine possess different numbers of WD40 (WD or beta-transducin repeat) repeats[7]. The analysis of conserved cores of internal repeats often occur symmetric units on structures, such as the protein phophatase 2A PR65 (HEAT), a superhelix with repeats [8]. Over the last two decades, a bunch of tools were developed for repeat sequence and structure detection. Several implementations were designed at the DNA level, such as Reputer[9], CGSSR[10], Repseek[11], while Swelfe[12], REPRO[13], and REPETITA[14] performed at the structure level. Traditional approaches for identifying internal repeats were based on sequence alignment strategies, especially when coping with protein sequences without resolved protein structures. It is not easy to predict the internal repeats within a protein since the highly varied residue contents usually occurred for the identical substructures within a protein. However, it becomes relatively simple when a protein structure is known for quantitative analysis of repetitive composition. The sequence alignment approaches were satisfied only confronting with sequences with high similarity and regularity. Such alignment tools combined with comprehensive genomic databases of various species can be efficiently applied to identify homologous sequences which possessing with known repetitive information. These well- known tools include PAM [15], BLAST[16], PSI- BLAST[17], and ClustalW[18]. However, if protein segments possess low sequence similarity, then sequence alignment based methods become invalid for internal repeat detection. Unfortunately, it has been verified that sequence contents of repeat structure units of most proteins with internal repeats are highly diverged among all various species. Hence, the structural information of secondary structure information provides an alternative way to analyze and predict the locations of repeat segments within a protein, since the secondary structures always possess highly 2010 International Conference on Complex, Intelligent and Software Intensive Systems 978-0-7695-3967-6/10 $26.00 © 2010 IEEE DOI 10.1109/CISIS.2010.92 689