INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS, VOL. 44, APRIL 2012 14 A scoring method for the clustering of nucleic acid sequences Barile ´ e B. Baridam Department of Computer Science, University of Pretoria, South Africa Email: bbaridam@cs.up.ac.za Abstract—The clustering of biological sequence data is a significant task for biologists. The reason is that sequence clustering assists molecular biologists to group sequences based on the ancestral traits or hereditary information that are hidden in sequences. To accomplish the similarity detection and clustering tasks, several clustering algorithms, similarity and distance measures have been proposed. Most of these algorithms and similarity measures manifest some form of inefficiency in the detection of sequences based on their structural similarity as was observed in the course of this study. In this paper, the codon-based scoring method (COBASM) is developed to handle this inefficiency. COBASM employs the codon principle, by the application of triplet nucleotides, in the clustering of nucleic acid sequences. The results obtained show that COBASM is able to produce compact and well- separated clusters based on the structural similarity of sequences. General Terms Clustering, homology, similarity measure, sequences Keywords Codon, scoring method, similarity measure, clustering 1 I NTRODUCTION In computational biology, clustering goes beyond a mere statistical tool for information retrieval. In sequence clustering, it aims at revealing the genetic information of participating sequences. Cluster analysis helps in the determination of gene families and the establishment of implicit links between them. Clustering of biological sequence data presents a great challenge to the comput- ing society as well as to biologists. This challenge arises from the fact that conventional similarity measures ex- hibit difficulty in detecting structural similarities among sequences, and that sequences cannot be easily clustered by the application of conventional distance or similarity measures that are commonly applied to numeric data sets. This is so because nucleic acids sequences are never represented numerically. Nucleic acid sequences are represented by symbols. Also, string edit distance al- gorithms [1] employed in string comparisons and string similarity searches are mostly not suitable in biological sequence data clustering [2]. This is basically because, as stated above, the structural nature of biological se- quences makes string edit distance [1] not appropriate. For example, the edit between the strings (sequences) CCCCCCCGGGGGGG and GGGGGGGCCCCCCC shows there is no similarity between the strings. However, looking at the strings biologically, there is an element of structural similarity which the edit distance neglects. The application of multiple sequence alignment is employed in most cases to overcome this structural challenge. Because structural similarity is a major issue in bi- ological sequence analysis, it becomes very important to design a similarity measure (scoring method) that will consider such, without the introduction sequence alignment. 1.1 The Case of Sequence Homology. Similar biological sequences, nucleotides or amino acids are often derived from the same ancestral sequence and are therefore expected to share common structure and function even when the sequences are from different organisms. These sequences, which are very similar, are called homologues. It is believed, for example, that if the sequences are about 100 nucleotides long (or 100 amino acids long for proteins), they may be considered homol- ogous if 70 percent of the nucleotides (or 25 percent of amino acids) are identical [3]. Where the percentage values are less than 70 (or 25), as the case may be, it is believed that the twilight zone, where the meaning of the observed similarity is doubtful and homology (similarity due to common evolutionary history) or non-homology is never guaranteed, is reached. The homology concept is utilized in the codon-based scoring method, COBASM 1 . 2 CLUSTERING AND SIMILARITY SEARCH PROBLEMS The successful application of the average linkage hier- archical clustering algorithm for the expression data of budding yeast Saccharomyces cerevisiae and the reaction of human fibroblasts to serum by Eisen et al [4] heralded the application of cluster analysis in the grouping of func- tionally similar genes [5]. In particular, hierarchical clus- tering has been used to organize genes into a hierarchical 1. COBASM is interchangeably referred to as a scoring method or a similarity measure.