Semantic Subgroup Discovery:
Using Ontologies in Microarray Data Analysis
Nada Lavraˇ c, Petra Kralj Novak, Igor Mozetiˇ c, Vid Podpeˇ can, Helena Motaln, Marko Petek, Kristina Gruden
Abstract— A major challenge for next generation data mining
systems is creative knowledge discovery from highly diverse and
distributed data and knowledge sources. This paper presents
an approach to information fusion and creative knowledge
discovery from semantically annotated knowledge sources: by
using ontology information as background knowledge for se-
mantic subgroup discovery, rules are constructed that allow the
expert to recognize gene groups that are differentially expressed
in different types of tissues. The paper presents also current
directions in creative knowledge discovery through bisociative
data analysis, illustrated on a systems biology case study.
I. I NTRODUCTION
Biologists collect large quantities of data from wet lab
experiments and high-throughput platforms. Public biolog-
ical databases, like Gene Ontology, Kyoto Encyclopedia
of Genes and Genomes and ENTREZ, are some of the
sources of biological knowledge. Since the growing amounts
of available knowledge and data exceed human analytical
capabilities, technologies that help analyzing and extracting
useful information from such vast amounts of data need to
be developed and used.
This paper presents an approach to information fusion
and semantic subgroup discovery, by using ontologies as
background knowledge in microarray data analysis. Let us
first explain the basic notions: information fusion, subgroup
discovery, semantic subgroup discovery and bisociative rea-
soning which is at the heart of creative, accidental discovery
(serendipity).
Information Fusion: Information fusion can be defined
as a study of efficient methods for automatically or
semi-automatically transforming information from different
sources and different points in time into a representation that
provides effective support for human and automated decision
making [1]. Recent investigations in using information fusion
to support scientific decision making within bioinformatics
include [2], [7]. Smirnov et al. [10] exploit the idea of formu-
lating an ontology-based model of the problem to be solved
by the user and interpreting it as a constraint satisfaction
problem taking into account information from a dynamic
environment. An approach to the integration of biological
databases GO, KEGG and ENTREZ is implemented in the
SEGS information fusion engine (Searching for Enriched
Nada Lavraˇ c, Petra Kralj Novak, Igor Mozetiˇ c and Vid Podpeˇ can
are with Joˇ zef Stefan Institute, Jamova 39, Ljubljana, Slovenia.
Nada Lavraˇ c is also with University of Nova Gorica, Vipavska 13,
5000 Nova Gorica, Slovenia, {nada.lavrac, petra.kralj,
igor.mozetic, vid.podpecan}@ijs.si. Helena Motaln,
Marko Petek and Kristina Gruden are with National Institute of Bi-
ology, Veˇ cna pot 111, Ljubljana, Slovenia, {helena.motaln,
marko.petek, kristina.gruden}@nib.si.
Gene Sets) [13]. Another, much larger, integrated annotated
bioinformatics information resource is Biomine [9]. The
later two approaches are used for information fusion in the
methodology presented in this paper.
Subgroup Discovery: Subgroup discovery techniques are
used to generate explicit knowledge in the form of rules
that allow the user to recognize important relationships
in a set of class labeled training instances, describing the
target property of interest. Consider two applications. In the
first one, the induced subgroup describing rules suggest the
general practitioner how to select individuals for population
screening, concerning high risk for coronary heart disease
(CHD) [3]. The rule below describes a group of overweight
female patients older than 63 years:
High CHD Risk ← sex = female & age > 63 years &
body mass index > 25 kgm
-2
In the second application [4], subgroup describing rules sug-
gest genes that are characteristic for a given cancer type (i.e.,
leukemia cancer) in an application of distinguishing among
14 different cancer types: leukemia, CNS, lung cancer, etc.:
Leukemia ← KIAA0128 is diff expressed &
prostaglandin d2 synthase is not diff expressed
Semantic Subgroup Discovery: Semantic subgroup dis-
covery refers to subgroup discovery, where semantically
annotated knowledge sources (ontologies) are used as back-
ground knowledge in the data mining process. Using the
technology of relational subgroup discovery (RSD) [14],
we have developed an approach to information fusion and
semantic data mining, enabling background knowledge in the
form of ontologies to be used in relational machine learning.
The relational subgroup discovery approach, which was
successfully adapted and applied to mining of bioinformatics
data [12], generates descriptive rules as conjunctions of
ontology terms from the GO, KEGG and ENTREZ ontolo-
gies. For instance, an induced description of geneGroup(A)
discovered by RSD for the CNS (central nervous system)
cancer class in a problem of distinguishing between 14
cancer types, determines group of genes A differentially
expressed in CNS as a conjunction of two relational features:
f
i
(A) = interaction(A,B) & process(B,’phosphorylation’) and
f
k
(A) = interaction(A,B) & process(B,’negative regulation of
apoptosis’) & component(B,’intracellular membrane-bound
organelle’).
The RSD semantic subgroup discovery approach was fur-
ther refined in the SEGS algorithm (Searching for Enriched
Gene Sets) [13], which is used in the information fusion and
5613
31st Annual International Conference of the IEEE EMBS
Minneapolis, Minnesota, USA, September 2-6, 2009
978-1-4244-3296-7/09/$25.00 ©2009 IEEE