978-1-4244-2900-4/08/$25.00 ©2008 IEEE ICIAFS08 Classification of Protein Sequences using the Growing Self-Organizing Map Norashikin Ahmad, Damminda Alahakoon, Rowena Chau Clayton School of Information Technology Monash University Clayton, Victoria, Australia Email: {norashikin.ahmad, damminda.alahakoon, rowena.chau}@infotech.monash.edu.au Abstract— Protein sequence analysis is an important task in bioinformatics. The classification of protein sequences into groups is beneficial for further analysis of the structures and roles of a particular group of protein in biological process. It also allows an unknown or newly found sequence to be identified by comparing it with protein groups that have already been studied. In this paper, we present the use of Growing Self-Organizing Map (GSOM), an extended version of the Self-Organizing Map (SOM) in classifying protein sequences. With its dynamic structure, GSOM facilitates the discovery of knowledge in a more natural way. This study focuses on two aspects; analysis of the effect of spread factor parameter in the GSOM to the node growth and the identification of grouping and subgrouping under different level of abstractions by using the spread factor. Keywords— protein sequence, classification, clustering, self- organizing map I. INTRODUCTION uman Genome Project [1] has resulted in a rapid increase of biological data including protein sequences in the biological databases. This situation has led to the need for effective computational tools that are essential for analyzing very large amount of data. Being the product of molecular evolution, protein sequences provide a lot of information. Sequences which are highly similar have diverged from a common ancestor and they usually have similar structure and perform the same roles in biological processes. The fundamental methods used in sequence analysis to identify similarities between protein sequences are pair-wise sequence comparison for comparing two sequences and multiple sequence alignment. The earliest method developed for pair-wise comparison is dynamic programming algorithm by Needleman and Wunsch [2] (global alignment) and Smith and Waterman [3] (local alignment). Dynamic programming is computationally expensive and could cater only a small number of sequences. FASTA [4] and BLAST [5] algorithms which employ heuristic techniques have been developed to overcome this problem; they are faster, but are less accurate than dynamic programming methods. On the other hand, the multiple sequence alignment method is used to identify conserved motifs by aligning together a set of related or homologous sequences. From this alignment, a consensus pattern that characterizes a protein group or family can be discovered. This method has been utilized as a basis in classifying protein sequences into families in many secondary databases such as PROSITE [6] (uses regular expressions pattern) and Pfam [7] (uses Hidden Markov Models). Classification of protein sequences into groups or families is beneficial as it enables further analysis to be made within a group. Identification of a new sequence such as its possible structure and function also can be made easier by comparing it with existing groups which have already been studied. Artificial neural networks have been widely used in solving problems in many areas including protein sequence classification [8-12]. The unsupervised neural networks such as Self-Organizing Map (SOM) [13] has some advantages over the supervised methods as it does not require examples in its learning process. SOM also can construct a non-linear projection of complex and high-dimensional input signal into a low dimensional map which at the same time provides the visualization of the cluster grouping. These properties have made SOM a very useful tool in biological data analysis and discovery. This paper introduces the use of Growing Self-Organizing Network (GSOM) [14], which is a SOM-based algorithm in classifying protein sequences. Unlike SOM which has a fixed structure, GSOM provides the ability to grow nodes to better represent the discovered patterns. With spread factor (SF) parameter, the growth or spread of the map can be controlled thus giving an analyst a flexibility to analyze the resulting clusters at different granularities. GSOM has been proved effective in pattern discovery of biological and biomedical data such as leukemia gene expression [15], sleep apnea and dermatology [16] and DNA sequence fragments [17]. In this paper, classification of protein sequences has been carried out and the growing characteristic of GSOM across spread factors was investigated. The formation of the groups and subgroups H