Performance Evaluation of Protein Sequence Clustering Tools Haifeng Liu and Loo-Nin Teow DSO National Laboratories (Kent Ridge), 27 Medical Drive, Singapore 117510 lhaifeng@dso.org.sg Abstract. This paper aims to evaluate the clustering quality of various protein clustering tools that are publicly available as standalone applica- tions. We first review the current protein sequence clustering methods, and introduce a new incrementally clustering tool denoted as PINC. We then propose an intuitive performance metric for evaluating them. The evaluation results of the tools on the public database Pfam are reported. 1 Introduction With the enormous growth of public sequence database, grouping the sequences into protein families, is increasingly becoming important as it provides evolu- tionary, functional and structural information on the sequences. Protein families can be defined as those groups of molecules which share significant sequence similarity ([7]). Well characterized proteins within a family can hence allow one to reliably assign functions to family members whose functions are not known or not well understood. Many methods have been developed for clustering pro- tein sequences according to similarity (or distance) information. This approach usually groups homologous proteins together via a similarity measure obtained by sequence comparison. However, correctly clustering a large number of pro- tein sequences based on the approach still remains a huge challenge due to the following inherent characteristics of proteins: – Multi-domain proteins. Proteins containing multiple domains that may be similar to different sets of unrelated clusters can confound the protein fam- ily detection methods and result in the incorrect grouping of proteins into families. – Promiscuous domains. Promiscuous domains are widespread components of many proteins. Proteins assigned to a protein family purely on the basis of a promiscuous domain probably have very different functions with other members of that family. – Fragmented proteins. Public databases may contain fragmented protein se- quences (peptides). This incomplete information may lead to the incorrect assignment of a protein into a family. – Distantly related proteins. Certain proteins sharing little sequence identity can have the same function. It is hard to group this kind of distantly related proteins together without comparison of their three-dimensional structures. V.S. Sunderam et al. (Eds.): ICCS 2005, LNCS 3515, pp. 877–885, 2005. c Springer-Verlag Berlin Heidelberg 2005