A knowledge-driven approach to cluster validity assessment Nadia Bolshakova a, *, Francisco Azuaje b and Pádraig Cunningham a a Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland b School of Computing and Mathematics, University of Ulster, Jordanstown, BT37 0QB, U.K * To whom correspondence should be addressed ABSTRACT Summary: This paper presents an approach to assessing cluster validity based on similarity knowledge extracted from the Gene Ontology. Availability: The program is freely available for non-profit use on request from the authors. Contact: Nadia.Bolshakova@cs.tcd.ie Supplementary information: http://www.cs.tcd.ie/Nadia.Bolshakova/GOtool.html The automated integration of background knowledge is fundamental to support the generation and validation of hypotheses about the function of gene products. One such source of prior knowledge is the Gene Ontology (GO), which is a structured, shared vocabulary that allows the annotation of gene products across different model organisms. The GO comprises three independent hierarchies: molecular function (MF), biological process (BP) and cellular component (CC). Researchers can represent relationships between gene products and annotation terms in these hierarchies. Previous research has applied GO information to detect overrepresented functional annotations in clusters of genes obtained from expression analyses. It has also been suggested to assess gene sequence similarity and expression correlation. For additional information on the GO and its applications, the reader is referred to its website (http://www.geneontology.org) and (Wang et al., 2004). Topological and statistical information extracted from the GO in relation to a set of annotated gene products may be used to measure similarity between them. Different GO-driven similarity assessment methods may be then implemented to perform clustering or to quantify the quality of the resulting clusters. Cluster validity assessment may consist of data- and knowledge-driven methods, which aim to estimate the optimal cluster partition from a collection of candidate partitions. Data- driven methods mainly include statistical tests or validity indices applied to the data clustered. Knowledge-driven methods are proposed to enhance the predictive reliability and biological relevance of the results. A data-driven, cluster validity assessment platform was previously reported by (Bolshakova and Azuaje, 2003). Traditional GO-based cluster description methods have consisted of statistical analyses of the enrichment of GO terms in a cluster. The application of GO-based similarity to perform clustering and validate clustering outcomes has not been widely investigated. A recent contribution by Speer et al. (2004) presented an algorithm that incorporates GO annotations to cluster genes. They applied the Davies-Bouldin index (Bolshakova and Azuaje, 2003) to estimate the quality of the clusters. © The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org Bioinformatics Advance Access published February 15, 2005 by guest on February 18, 2013 http://bioinformatics.oxfordjournals.org/ Downloaded from