Semantic Subgroup Discovery: Using Ontologies in Microarray Data Analysis Nada Lavraˇ c, Petra Kralj Novak, Igor Mozetiˇ c, Vid Podpeˇ can, Helena Motaln, Marko Petek, Kristina Gruden Abstract— A major challenge for next generation data mining systems is creative knowledge discovery from highly diverse and distributed data and knowledge sources. This paper presents an approach to information fusion and creative knowledge discovery from semantically annotated knowledge sources: by using ontology information as background knowledge for se- mantic subgroup discovery, rules are constructed that allow the expert to recognize gene groups that are differentially expressed in different types of tissues. The paper presents also current directions in creative knowledge discovery through bisociative data analysis, illustrated on a systems biology case study. I. I NTRODUCTION Biologists collect large quantities of data from wet lab experiments and high-throughput platforms. Public biolog- ical databases, like Gene Ontology, Kyoto Encyclopedia of Genes and Genomes and ENTREZ, are some of the sources of biological knowledge. Since the growing amounts of available knowledge and data exceed human analytical capabilities, technologies that help analyzing and extracting useful information from such vast amounts of data need to be developed and used. This paper presents an approach to information fusion and semantic subgroup discovery, by using ontologies as background knowledge in microarray data analysis. Let us first explain the basic notions: information fusion, subgroup discovery, semantic subgroup discovery and bisociative rea- soning which is at the heart of creative, accidental discovery (serendipity). Information Fusion: Information fusion can be defined as a study of efficient methods for automatically or semi-automatically transforming information from different sources and different points in time into a representation that provides effective support for human and automated decision making [1]. Recent investigations in using information fusion to support scientific decision making within bioinformatics include [2], [7]. Smirnov et al. [10] exploit the idea of formu- lating an ontology-based model of the problem to be solved by the user and interpreting it as a constraint satisfaction problem taking into account information from a dynamic environment. An approach to the integration of biological databases GO, KEGG and ENTREZ is implemented in the SEGS information fusion engine (Searching for Enriched Nada Lavraˇ c, Petra Kralj Novak, Igor Mozetiˇ c and Vid Podpeˇ can are with Joˇ zef Stefan Institute, Jamova 39, Ljubljana, Slovenia. Nada Lavraˇ c is also with University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia, {nada.lavrac, petra.kralj, igor.mozetic, vid.podpecan}@ijs.si. Helena Motaln, Marko Petek and Kristina Gruden are with National Institute of Bi- ology, Veˇ cna pot 111, Ljubljana, Slovenia, {helena.motaln, marko.petek, kristina.gruden}@nib.si. Gene Sets) [13]. Another, much larger, integrated annotated bioinformatics information resource is Biomine [9]. The later two approaches are used for information fusion in the methodology presented in this paper. Subgroup Discovery: Subgroup discovery techniques are used to generate explicit knowledge in the form of rules that allow the user to recognize important relationships in a set of class labeled training instances, describing the target property of interest. Consider two applications. In the first one, the induced subgroup describing rules suggest the general practitioner how to select individuals for population screening, concerning high risk for coronary heart disease (CHD) [3]. The rule below describes a group of overweight female patients older than 63 years: High CHD Risk sex = female & age > 63 years & body mass index > 25 kgm -2 In the second application [4], subgroup describing rules sug- gest genes that are characteristic for a given cancer type (i.e., leukemia cancer) in an application of distinguishing among 14 different cancer types: leukemia, CNS, lung cancer, etc.: Leukemia KIAA0128 is diff expressed & prostaglandin d2 synthase is not diff expressed Semantic Subgroup Discovery: Semantic subgroup dis- covery refers to subgroup discovery, where semantically annotated knowledge sources (ontologies) are used as back- ground knowledge in the data mining process. Using the technology of relational subgroup discovery (RSD) [14], we have developed an approach to information fusion and semantic data mining, enabling background knowledge in the form of ontologies to be used in relational machine learning. The relational subgroup discovery approach, which was successfully adapted and applied to mining of bioinformatics data [12], generates descriptive rules as conjunctions of ontology terms from the GO, KEGG and ENTREZ ontolo- gies. For instance, an induced description of geneGroup(A) discovered by RSD for the CNS (central nervous system) cancer class in a problem of distinguishing between 14 cancer types, determines group of genes A differentially expressed in CNS as a conjunction of two relational features: f i (A) = interaction(A,B) & process(B,’phosphorylation’) and f k (A) = interaction(A,B) & process(B,’negative regulation of apoptosis’) & component(B,’intracellular membrane-bound organelle’). The RSD semantic subgroup discovery approach was fur- ther refined in the SEGS algorithm (Searching for Enriched Gene Sets) [13], which is used in the information fusion and 5613 31st Annual International Conference of the IEEE EMBS Minneapolis, Minnesota, USA, September 2-6, 2009 978-1-4244-3296-7/09/$25.00 ©2009 IEEE