1 Mining Constrained Association Rules to Predict Heart Disease Carlos Ordonez 1 , Edward Omiecinski 1 , Levien de Braal 1 , Cesar A. Santana 2 , Norberto Ezquerra 1 , Jose A. Taboada 3 , David Cooke 2 , Elizabeth Krawczynska 2 , Ernest V. Garcia 2 1 Georgia Institute 2 Emory University 3 Universidad de Santiago of Technology Hospital Compostela Abstract—This work describes our experiences on discovering association rules in medical data to predict heart disease. We focus on two aspects in this work: mapping medical data to a transaction format suitable for mining association rules and iden- tifying useful constraints. Based on these aspects we introduce an improved algorithm to discover constrained association rules. We present an experimental section explaining several interesting discovered rules. I. I NTRODUCTION Data Mining is an active research area. One of the most popular approaches to do data mining is discovering associa- tion rules [1], [2]. Association rules are generally used with basket, census or financial data. On the other hand, medical data is generally analyzed with classifier trees, clustering, or regression, but rarely with association rules. A survey on these techniques is found in [10]. In this work we analyze the idea of discovering constrained association rules in medical records that include numeric, cate- gorical, time and image data. This work is based on a long time joint research effort by Georgia Tech and Emory University to discover knowledge in medical data to predict coronary heart disease [7], [6], [5], [13], [14]. In [6] association rules are proposed and preliminary results are justified from the medical point of view. In [5] neural networks are used to predict reversibility images based on stress and myocardial thickening images. In [14] we explore the idea of constraining association rules in binary data and report preliminary findings from a data mining perspective. One of the most important features of association rules is that they are combinatorial in nature. This is particularly useful to discover patterns that appear in subsets of all the attributes. However, most patterns discovered by algorithms that do not constrain associations are not useful because they may contain redundant information, may be irrelevant or describe trivial knowledge. The goal is then to find those rules that are medically significant or interesting, but which also have minimum support and confidence. Copyright 2001 IEEE. Published in International Conference on Data Mining (ICDM), p. 433-440, 2001. Personal use of this material is per- mitted. However, permission to reprint/republish this material for adver- tising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. http://doi.ieeecomputersociety.org/10.1109/ICDM.2001.989549 In our research project the discovered rules have two main purposes: validating rules used by an expert system to aid in diagnosing coronary heart disease (PERFEX [9], [7]) and discovering new rules that relate patient data to heart disease and thus can enrich the expert system knowledge base. At the moment all rules used by our expert system were discovered and validated by a group of domain experts, as described in detail in [9]. Since PERFEX is essentially a production rule system (i.e., composed of IF-THEN rules) used in conjunction with temporal and uncertainty reasoning models, the discovery of knowledge resulting from association rule mining would represent a potentially powerful and innovative way to validate and acquire knowledge to enhance the knowledge base. Impor- tantly, the methods proposed herein are capable of inferring medical knowledge from a vast array of data that includes image and alphanumeric data that represent highly relevant, patient-specific clinical data (such as electrocardiographic in- formation, patient history, symptoms and the results of clinical tests). Hence the methods described in this paper may provide a more efficient knowledge acquisition technique than classical approaches. Throughout the paper we try to provide a general framework for understanding the approach underlying our research. We believe many of the problems we are facing (small data size, richness of content, high dimensionality, missing information, etc) are likely to appear in other domains. As such, this work tries to isolate those problems that we consider will be of greatest interest to the data mining community. A. Contributions and paper outline Our main contributions are the following. First, a justifica- tion is given for the use of association rules in the medical domain. We explain why mining medical data for association rules is an interesting and hard problem and we present the problem in an abstract manner so that this work can be applied to other domains. We introduce a simple mapping algorithm that transforms medical records into a binary format suitable to mine constrained association rules. We identify important constraints to make association rules useful for the medical domain and propose an algorithm to discover constrained association rules with very low support and relatively high confidence. Finally, we identify open problems that require further research.