Pattern Recognition 39 (2006) 2344 – 2355 www.elsevier.com/locate/patcog Efficient bottom-up hybrid hierarchical clustering techniques for protein sequence classification P.A. Vijaya ∗ , M. Narasimha Murty, D.K. Subramanian Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India Received 5 July 2005; received in revised form 24 October 2005; accepted 2 December 2005 Abstract Hybrid hierarchical clustering techniques which combine the characteristics of different partitional clustering techniques or partitional and hierarchical clustering techniques are interesting. In this paper, efficient bottom-up hybrid hierarchical clustering (BHHC) techniques have been proposed for the purpose of prototype selection for protein sequence classification. In the first stage, an incremental partitional clustering technique such as leader algorithm (ordered leader no update (OLNU) method) which requires only one database (db) scan is used to find a set of subcluster representatives. In the second stage, either a hierarchical agglomerative clustering (HAC) scheme or a partitional clustering algorithm—‘K-medians’ is used on these subcluster representatives to obtain a required number of clusters. Thus, this hybrid scheme is scalable and hence would be suitable for clustering large data sets and we also get a hierarchical structure consisting of clusters and subclusters and the representatives of which are used for pattern classification. Even if more number of prototypes are generated, classification time does not increase much as only a part of the hierarchical structure is searched. The experimental results (classification accuracy (CA) using the prototypes obtained and the computation time) of the proposed algorithms are compared with that of the hierarchical agglomerative schemes, K-medians and nearest neighbour classifier (NNC) methods. The proposed methods are found to be computationally efficient with reasonably good CA. 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Hybrid clustering; Hierarchical structure; Protein sequences; Median strings/sequences; Prototypes; Feature selection; Classification accuracy 1. Introduction Clustering is an active research topic in pattern recogni- tion, data mining, statistics and machine learning with di- verse emphasis. We use clustering as a tool for prototype selection for pattern classification. It is applicable for both labelled and unlabelled data sets as the labels are not used while clustering the patterns based on distance/similarity measures. The earlier clustering approaches do not ade- quately consider the fact that the data set can be too large and may not fit in the main memory of some computers. ∗ Corresponding author. Tel.: +91 80 2293 2368 113; fax: +91 80 2360 2911. E-mail addresses: pav@csa.iisc.ernet.in (P.A. Vijaya), mnm@csa.iisc.ernet.in (M. Narasimha Murty), dks@csa.iisc.ernet.in (D.K. Subramanian). 0031-3203/$30.00 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.12.001 It is necessary to examine the principle of clustering to devise efficient algorithms to minimize the I/O operations and space requirements and to get appropriate prototypes/abstractions to increase the classification accuracy (CA). One such ap- plication area where efficient clustering techniques are re- quired is in bioinformatics. In this paper, we are interested in designing hybrid hierarchical clustering techniques for pat- tern classification, which are scalable and also suitable for protein sequences encountered in bioinformatics. This paper is organized as follows. In Section 2, signif- icance of protein sequence clustering and protein sequence alignment procedure are given in brief. Section 3 discusses on the application of clustering techniques for protein se- quences and the methods of improving the scalability of a technique for clustering large data sets. Section 4 contains the details of the proposed method. Experimental results and discussions are presented in Section 5. Conclusions and fur- ther research scope are provided in Section 6.