A knowledge-driven approach to cluster validity assessment
Nadia Bolshakova
a,
*, Francisco Azuaje
b
and Pádraig Cunningham
a
a
Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland
b
School of Computing and Mathematics, University of Ulster, Jordanstown, BT37 0QB,
U.K
*
To whom correspondence should be addressed
ABSTRACT
Summary: This paper presents an approach to assessing cluster validity based on
similarity knowledge extracted from the Gene Ontology.
Availability: The program is freely available for non-profit use on request from the
authors.
Contact: Nadia.Bolshakova@cs.tcd.ie
Supplementary information: http://www.cs.tcd.ie/Nadia.Bolshakova/GOtool.html
The automated integration of background knowledge is fundamental to support the generation
and validation of hypotheses about the function of gene products. One such source of prior
knowledge is the Gene Ontology (GO), which is a structured, shared vocabulary that allows the
annotation of gene products across different model organisms. The GO comprises three independent
hierarchies: molecular function (MF), biological process (BP) and cellular component (CC).
Researchers can represent relationships between gene products and annotation terms in these
hierarchies. Previous research has applied GO information to detect overrepresented functional
annotations in clusters of genes obtained from expression analyses. It has also been suggested to
assess gene sequence similarity and expression correlation. For additional information on the GO
and its applications, the reader is referred to its website (http://www.geneontology.org) and (Wang
et al., 2004).
Topological and statistical information extracted from the GO in relation to a set of annotated
gene products may be used to measure similarity between them. Different GO-driven similarity
assessment methods may be then implemented to perform clustering or to quantify the quality of the
resulting clusters. Cluster validity assessment may consist of data- and knowledge-driven methods,
which aim to estimate the optimal cluster partition from a collection of candidate partitions. Data-
driven methods mainly include statistical tests or validity indices applied to the data clustered.
Knowledge-driven methods are proposed to enhance the predictive reliability and biological
relevance of the results. A data-driven, cluster validity assessment platform was previously reported
by (Bolshakova and Azuaje, 2003).
Traditional GO-based cluster description methods have consisted of statistical analyses of the
enrichment of GO terms in a cluster. The application of GO-based similarity to perform clustering
and validate clustering outcomes has not been widely investigated. A recent contribution by Speer
et al. (2004) presented an algorithm that incorporates GO annotations to cluster genes. They applied
the Davies-Bouldin index (Bolshakova and Azuaje, 2003) to estimate the quality of the clusters.
© The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
Bioinformatics Advance Access published February 15, 2005
by guest on February 18, 2013 http://bioinformatics.oxfordjournals.org/ Downloaded from