Mining for Patterns Based on Contingency Tables by KL–Miner First Experience Jan Rauch (1,3) , Milan ˇ Sim ˚ unek (2,4) ,V´ aclav L´ ın (1) (1) Department of Knowledge and Information Engineering, (2) Department of Information Technologies, (3) EuroMISE centrum - Cardio, (4) LISp University of Economics Prague, n´ am. W. Churchilla 4, 130 67 Praha 3, Czech Republic E-mail: rauch@vse.cz, simunek@vse.cz, xlinv05@vse.cz Presented at ICDM 2003 workshop Foundations and New Directions of Data Mining see http://www.cs.sjsu.edu/faculty/tylin/icdm03_workshop.html Abstract A new datamining procedure called KL–Miner is presented. The procedure mines for various patterns based on eval- uation of two–dimensional contingency tables, including patterns of statistical nature. The procedure is a result of continued development of the academic LISp-Miner sys- tem for KDD. Keywords Data mining, contingency tables, the system LISp–Miner, statistical patterns 1 Introduction Goal of this paper is to present first experience with data mining procedure KL-Miner. The procedure mines for patterns of the form R ∼ C/γ . Here R and C are categorial attributes, the attribute R has categories (possible values) r 1 ,...,r K , the attribute C has categories c 1 ,...,c L . Further, γ is a Boolean at- tribute. The KL-Miner procedure deals with data matrices. We suppose that R and C correspond to columns of the anal- ysed data matrix. We further suppose that the Boolean attribute γ is somehow derived from other columns of the analysed data matrix and thus that it corresponds to a Boolean column of the analysed data matrix. The intuitive meaning of the expression R ∼ C/γ is that the attributes R and C are in relation given by the symbol ∼ when the condition given by the derived Boolean attribute γ is satisfied. The symbol ∼ is called KL-quantifier. It corresponds to a condition imposed by the user on the contingency ta- ble of R and C. There are several restrictions that the user can choose to use (e.g. minimal value, sum over the table, value of the χ 2 statistic, and other). We call the expression R ∼ C/γ a KL-hypothesis or simply hypothesis. The KL-hypothesis R ∼ C/γ is true in the data matrix M if the condition corresponding to the KL-quantifier ∼ is satisfied for the contingency table of R and C on the data matrix M/γ . The data matrix M/γ consists of all rows of data matrix M satisfying the condition γ (i.e. of all rows in which the value of γ is TRUE). Input of the procedure KL-Miner consists of the anal- ysed data matrix and of several parameters defining a set of potentially interesting hypotheses. Such a set can be very large. The procedure KL-Miner automatically gen- erates all potentially interesting hypotheses and verifies them in the analysed data matrix. The output of the proce- dure KL-Miner consists of all hypotheses that are true in the analysed data matrix (i.e. supported by the analysed data). Some details about input of the KL-miner procedure are given in the section 2. KL-quantifiers are described in section 3. Implementation of the KL-Miner procedure is based on a bit string approach [3, 4]. The principles of the KL- Miner procedure implementation are described in section 4. An example of application is given in section 5. Some remarks on scalability are in section 5.2 . The KL-Miner procedure is a part of the LISp-Miner system 1 [4, 6]. The LISp-Miner system consists of several data mining procedures that can be combined in various ways to enhance the mining power. Let us remark that the KL-Miner is a GUHA proce- dure in the sense of the book [1]. Therefore, we shall use the terminology introduced in [1]. The potentially 1 See http://lispminer.vse.cz