Testing the Significance of Attribute Interactions Aleks Jakulin jakulin@acm.org Ivan Bratko ivan.bratko@fri.uni-lj.si Faculty of Computer and Information Science, Trˇ zaˇ ska cesta 25, SI-1001 Ljubljana, Slovenia Abstract Attribute interactions are the irreducible de- pendencies between attributes. Interactions underlie feature relevance and selection, the structure of joint probability and classifica- tion models: if and only if the attributes in- teract, they should be connected. While the issue of 2-way interactions, especially of those between an attribute and the label, has al- ready been addressed, we introduce an opera- tional definition of a generalized n-way inter- action by highlighting two models: the reduc- tionistic part-to-whole approximation, where the model of the whole is reconstructed from models of the parts, and the holistic reference model, where the whole is modelled directly. An interaction is deemed significant if these two models are significantly different. In this paper, we propose the Kirkwood superposi- tion approximation for constructing part-to- whole approximations. To model data, we do not assume a particular structure of inter- actions, but instead construct the model by testing for the presence of interactions. The resulting map of significant interactions is a graphical model learned from the data. We confirm that the P -values computed with the assumption of the asymptotic χ 2 distribution closely match those obtained with the boot- strap. 1. Introduction 1.1. Information Shared by Attributes We will address the problem of how much one attribute tells about another, how much information is shared between attributes. This general problem comprises Appearing in Proceedings of the 21 st International Confer- ence on Machine Learning, Banff, Canada, 2004. Copyright 2004 by the authors. both attribute relevance and attribute interactions. Before that, we need to define a few terms. Formally, an attribute A will be considered to be a collection of independent, but mutually exclusive attribute values {a 1 ,a 2 ,a 3 ,...,a n }. We will write a as an example of the value of A. An instance corresponds to an event that is the conjunction of attributes’ values. For ex- ample, an instance is “Playing tennis in hot weather.” Such instances are described with two attributes, A with the range < A = {play, ¬play}, and with the at- tribute B : < B = {cold, warm, hot}. If our task is deciding whether to play or not to play, the attribute A has the role of the label. An attribute is relevant to predicting the label if it has something in common with it. To be able to estimate this commonness, we need a general model that con- nects both the attribute and the label that functions with uncertain and noisy data. In general, models with uncertainty can be stated in terms of the joint prob- ability density functions. A joint probability density function (joint PDF) maps each possible combination of attribute values into the probability of its occur- rence. The joint PDF p for this example is a map p : < A ×< B [0, 1]. From the joint PDF, we can al- ways obtain a marginal PDF by removing or marginal- izing one or more attributes. The removal is performed by summing probabilities over all the combinations of values of the removed attributes. For example, the PDF of attribute A would hence be p(a)= b p(a, b). One way of measuring uncertainty given a joint PDF p is with Shannon’s entropy H, defined for a joint PDF of a set of attributes V : H(V ) , X ~v∈< V p(~v) log 2 p( ~v) (1) If V = {A, B}, the ~v would have the range of < V = < A ×< B – the Cartesian product of ranges of indi- vidual attributes. If the uncertainty given the joint PDF is H(AB), and the uncertainties given the two marginal PDFs are H(A) and H(B), the shared un- certainty, the mutual information or information gain