978-1-4244-2900-4/08/$25.00 ©2008 IEEE ICIAFS08
Classification of Protein Sequences using the
Growing Self-Organizing Map
Norashikin Ahmad, Damminda Alahakoon, Rowena Chau
Clayton School of Information Technology
Monash University
Clayton, Victoria, Australia
Email: {norashikin.ahmad, damminda.alahakoon, rowena.chau}@infotech.monash.edu.au
Abstract— Protein sequence analysis is an important task in
bioinformatics. The classification of protein sequences into
groups is beneficial for further analysis of the structures and
roles of a particular group of protein in biological process. It also
allows an unknown or newly found sequence to be identified by
comparing it with protein groups that have already been studied.
In this paper, we present the use of Growing Self-Organizing
Map (GSOM), an extended version of the Self-Organizing Map
(SOM) in classifying protein sequences. With its dynamic
structure, GSOM facilitates the discovery of knowledge in a more
natural way. This study focuses on two aspects; analysis of the
effect of spread factor parameter in the GSOM to the node
growth and the identification of grouping and subgrouping under
different level of abstractions by using the spread factor.
Keywords— protein sequence, classification, clustering, self-
organizing map
I. INTRODUCTION
uman Genome Project [1] has resulted in a rapid increase
of biological data including protein sequences in the
biological databases. This situation has led to the need for
effective computational tools that are essential for analyzing
very large amount of data.
Being the product of molecular evolution, protein sequences
provide a lot of information. Sequences which are highly
similar have diverged from a common ancestor and they
usually have similar structure and perform the same roles in
biological processes. The fundamental methods used in
sequence analysis to identify similarities between protein
sequences are pair-wise sequence comparison for comparing
two sequences and multiple sequence alignment. The earliest
method developed for pair-wise comparison is dynamic
programming algorithm by Needleman and Wunsch [2]
(global alignment) and Smith and Waterman [3] (local
alignment). Dynamic programming is computationally
expensive and could cater only a small number of sequences.
FASTA [4] and BLAST [5] algorithms which employ
heuristic techniques have been developed to overcome this
problem; they are faster, but are less accurate than dynamic
programming methods. On the other hand, the multiple
sequence alignment method is used to identify conserved
motifs by aligning together a set of related or homologous
sequences. From this alignment, a consensus pattern that
characterizes a protein group or family can be discovered. This
method has been utilized as a basis in classifying protein
sequences into families in many secondary databases such as
PROSITE [6] (uses regular expressions pattern) and Pfam [7]
(uses Hidden Markov Models).
Classification of protein sequences into groups or families is
beneficial as it enables further analysis to be made within a
group. Identification of a new sequence such as its possible
structure and function also can be made easier by comparing it
with existing groups which have already been studied.
Artificial neural networks have been widely used in solving
problems in many areas including protein sequence
classification [8-12]. The unsupervised neural networks such
as Self-Organizing Map (SOM) [13] has some advantages
over the supervised methods as it does not require examples in
its learning process. SOM also can construct a non-linear
projection of complex and high-dimensional input signal into a
low dimensional map which at the same time provides the
visualization of the cluster grouping. These properties have
made SOM a very useful tool in biological data analysis and
discovery.
This paper introduces the use of Growing Self-Organizing
Network (GSOM) [14], which is a SOM-based algorithm in
classifying protein sequences. Unlike SOM which has a fixed
structure, GSOM provides the ability to grow nodes to better
represent the discovered patterns. With spread factor (SF)
parameter, the growth or spread of the map can be controlled
thus giving an analyst a flexibility to analyze the resulting
clusters at different granularities. GSOM has been proved
effective in pattern discovery of biological and biomedical
data such as leukemia gene expression [15], sleep apnea and
dermatology [16] and DNA sequence fragments [17]. In this
paper, classification of protein sequences has been carried out
and the growing characteristic of GSOM across spread factors
was investigated. The formation of the groups and subgroups
H