Pattern Recognition 39 (2006) 2344 – 2355 www.elsevier.com/locate/patcog Efﬁcient bottom-up hybrid hierarchical clustering techniques for protein sequence classiﬁcation P.A. Vijaya ∗ , M. Narasimha Murty, D.K. Subramanian Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India Received 5 July 2005; received in revised form 24 October 2005; accepted 2 December 2005 Abstract Hybrid hierarchical clustering techniques which combine the characteristics of different partitional clustering techniques or partitional and hierarchical clustering techniques are interesting. In this paper, efﬁcient bottom-up hybrid hierarchical clustering (BHHC) techniques have been proposed for the purpose of prototype selection for protein sequence classiﬁcation. In the ﬁrst stage, an incremental partitional clustering technique such as leader algorithm (ordered leader no update (OLNU) method) which requires only one database (db) scan is used to ﬁnd a set of subcluster representatives. In the second stage, either a hierarchical agglomerative clustering (HAC) scheme or a partitional clustering algorithm—‘K-medians’ is used on these subcluster representatives to obtain a required number of clusters. Thus, this hybrid scheme is scalable and hence would be suitable for clustering large data sets and we also get a hierarchical structure consisting of clusters and subclusters and the representatives of which are used for pattern classiﬁcation. Even if more number of prototypes are generated, classiﬁcation time does not increase much as only a part of the hierarchical structure is searched. The experimental results (classiﬁcation accuracy (CA) using the prototypes obtained and the computation time) of the proposed algorithms are compared with that of the hierarchical agglomerative schemes, K-medians and nearest neighbour classiﬁer (NNC) methods. The proposed methods are found to be computationally efﬁcient with reasonably good CA.  2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Hybrid clustering; Hierarchical structure; Protein sequences; Median strings/sequences; Prototypes; Feature selection; Classiﬁcation accuracy 1. Introduction Clustering is an active research topic in pattern recogni- tion, data mining, statistics and machine learning with di- verse emphasis. We use clustering as a tool for prototype selection for pattern classiﬁcation. It is applicable for both labelled and unlabelled data sets as the labels are not used while clustering the patterns based on distance/similarity measures. The earlier clustering approaches do not ade- quately consider the fact that the data set can be too large and may not ﬁt in the main memory of some computers. ∗ Corresponding author. Tel.: +91 80 2293 2368 113; fax: +91 80 2360 2911. E-mail addresses: pav@csa.iisc.ernet.in (P.A. Vijaya), mnm@csa.iisc.ernet.in (M. Narasimha Murty), dks@csa.iisc.ernet.in (D.K. Subramanian). 0031-3203/$30.00  2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.12.001 It is necessary to examine the principle of clustering to devise efﬁcient algorithms to minimize the I/O operations and space requirements and to get appropriate prototypes/abstractions to increase the classiﬁcation accuracy (CA). One such ap- plication area where efﬁcient clustering techniques are re- quired is in bioinformatics. In this paper, we are interested in designing hybrid hierarchical clustering techniques for pat- tern classiﬁcation, which are scalable and also suitable for protein sequences encountered in bioinformatics. This paper is organized as follows. In Section 2, signif- icance of protein sequence clustering and protein sequence alignment procedure are given in brief. Section 3 discusses on the application of clustering techniques for protein se- quences and the methods of improving the scalability of a technique for clustering large data sets. Section 4 contains the details of the proposed method. Experimental results and discussions are presented in Section 5. Conclusions and fur- ther research scope are provided in Section 6.