Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 Knowledge discovery in a genetic database: The MINOS system. H. Ripoche J. Sallantin LIRMM, UMR 9928 CNRS - Montpellier II, 161 rue Ada, F-34392 Montpellier. E-mail: { hrjs}@lirmm.fr Abstract This paper concerns the management of genetic se- quences in an object-oriented database and the extrac- tion of knowledge from these sequences. In our case knowledge discovery consists in finding functions ca- pable of predicting properties about genetic sequences. This problem is also known as funciional znference. Th.e paper is divided in two parts: ihe first one shows the interesf of using an object-oriented query language to build and use prediction junctzons. In the second part, we propose to ‘use prediction functions as de- scriptors of sequences in order to index them. The indexation is perform.ed with con,cept lattices [1’7]. Keywords: Machine Learning/Discovery in Large Databases, Interactive Data Exploration and Diacoo- erg, Re-use of Discovered Knowledge, Object-Oriented Databases, Concept Lattices, Genetic Sequences. 1 Introduction: Discovery through data exploration and data criticism We think that scientific discovery requires methods that progressively gather and analyse heterogeneous information through an interaction with human ex- perts. As a consequence, we need a knowledge man- agement system capable of guiding the exploration of t,he expert by ernphasizing the inner relationship be- tween data. Thus the knowledge management system should facilitate data comparison and criticism. In this paper, we define an environment that helps a gradual and complete exploration of a se- quence database. This kind of operation is also called Database Mining. In our system, genetic sequences are represented by objects of an Object-Oriented Database Management System (OODBMS). This is the reason why we call it MINOS (MINing Object Sys- tem). The interaction with the user is accomplished by the use of the query language of the underlying database system. This query language permits to build functions that predict properties of genetic se- quences. It also permits to use these prediction func- tions to detect properties in sequences. If the proper- ties of the sequences are known from a direct biological experiment, the application of prediction functions on sequences is a way to criticize or validate these func- tions. This is a first kind of knowledge revision. When prediction functions are available, they can be used to describe sequences through a concept lat- tice. The interest of using concept lattice for sequence analysis is twofold: Firstly, it helps the navigation in a database of sequences by visualizing the sequences and their associated properties. The lattice can be graphically displayed, which provides hypertext-like functionalities [lo]. Secondly, it is a tool for knowl- edge revision because the nodes of the lattice group examples sharing a common set of properties and these relationships can be criticized. Figure 1 shows how the knowledge acquisition pro- cess works: An initial selection of data is analysed by a learning algorithm. The result of this algorithm (typically classes grouping initial data) is given to a human analyst. This person compares the result pro- duced via machine learning with previous result,s, and with his own knowledge of the problem. Then he sug- gests criticisms about the result that will help in the choice of a new set of data to be examined. Initial selection Learning => Result Figure 1: An interactive learning cycle. 91 lo6o-xxV95 $4.00 0 1995 IEEE Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE