Mining for lexons: applying unsupervised learning methods to create ontology bases M.-L. Reinberger , P. Spyns , W. Daelemans , R. Meersman (1) CNTS - University of Antwerp, Universiteitsplein x, B-2000 Antwerpen - Belgium reinberg,daelem @uia.ua.ac.be (2) STAR Lab - Vrije Universiteit Brussel, Pleinlaan 2 Gebouw G-10, B-1050 Brussel - Belgium Peter.Spyns, Robert.Meersman @vub.ac.be No Institute Given Abstract. Ontologies in current computer science parlance are computer based resources that represent agreed domain semantics. This paper first introduces the DOGMA ontology engineering approach that separates ”atomic” conceptual re- lations from ”predicative” domain rules. A DOGMA ontology consists of an on- tology base that holds sets of intuitive context-specific conceptual relations and a layer of ”relatively generic” ontological commitments that hold the domain rules. Secondly, we describe and experimentally evaluate work in progress on a potential method to automatically derive the atomic conceptual relations men- tioned above from a corpus of English medical texts. Preliminary outcomes are presented based on the clustering of nouns and compound nouns according to co-occurrence frequencies in the subject-verb-object syntactic context. Keywords: knowledge representation, machine learning, text mining, ontology, clustering, selectional restriction, co-composition. 1 Introduction and General background 1.1 The Semantic Web Internet technology has made IT users aware of both new opportunities as well as ac- tual needs for large scale interoperation of distributed, heterogeneous, and autonomous information systems. Additionally the vastness of the amount of information already on-line, or to be interfaced with the WWW, makes it unfeasible to depend merely on human users to correctly and comprehensively identify, access, fi lter and process the in- formation relevant for the purpose of applications over a given domain. Be they called software agents, web services, or otherwise, this is increasingly becoming the task of computer programs equipped with domain knowledge . Presently however there is an absence of usable formal, standardised and shared domain knowledge of what the in- formation stored inside these systems and exchanged through their interfaces actually means. Nevertheless this is a prerequisite for agents and services (or even for human users) wishing to access the information but who, obviously, were never involved when these systems were created. The pervasive and explosive proliferation of computerised