Textmining: Generating association rules from textual data Ch. C. Latiri and S. BenYahia Faculty of Sciences of Tunis Computer Science Department Campus Universitaire, 1060 Tunis, Tunisia E-mail: chiraz.latiri@gnet.tn E-mail: sadok.benyahia@fst.rnu.tn Abstract Textmining is an emerging research area, whose goal is to discover additional information from hidden patterns in unstructured large textual collection. Hence, given a collection of text documents, most approaches of text mining perform knowledge-discovery operations on labels associated with each document, which are usually keywords that represent the result of non-trivial keyword-labeling processes. In this paper, we are interested especially in the extraction of the associations from unstructured database, especially full text. The aim of this paper is twofold. First, to propose a conceptual approach, based on the formal concept analysis [GANT99], in order to discover knowledge, formally represented by association rules, from large textual corpus. Second, to introduce an algorithm to derive additional and implicit association rules, using an associated taxonomy, from the already discovered association rules. Key words: Textual data, Data mining, Formal concepts, Galois connection, Textmining, Implicit association rule. Mots clès: corpus textuel, ECD, ECT, concept formel, connexion de galois, règles associatives implicites. 1. Introduction Much of the information is now in textual form [THUR98]. One of the problems with textual data, is that it is available in unstructured or in semi-structured databases. The availability of document collections and especially of on-line information is rapidly growing. Thus, it is necessary to provide automatic tools for analysing large textual collections. Accordingly, in analogy to datamining to structured data, textmining is defined for textual data [LAND98]. In fact, we define textmining to be the science of extracting additional information from hidden patterns in unstructured large textual collection [SING99]. It is all about extracting associations previously unknown from large text databases [FELD95, FELD96b, FELD96a, FELD98a, FELD97]. Textmining shares many characteristics with classical datamining, but differs in many ways [LAND98]: 1. many knowledge discovery algorithms defined in the context of datamining, are irrelevant or ill suited for the textual application; 2. special mining tasks, such as concept relationship analysis, are unique to textmining. 3. the unstructured form of the full text necessitates special linguistic pre-processing for extracting the main features of the text. In this paper, we are interested especially in the extraction of the associations from unstructured database, such as textual database. The aim of this paper is twofold. First, to propose a conceptual approach, based on the formal concept analysis [GANT99], in order to discover knowledge, formally represented by association rules, from large textual corpus. Second, to introduce an algorithm to derive additional and implicit association rules, using an associated taxonomy, from the already discovered association rules.