Case-Base Reduction for a Computer Assisted Breast Cancer Detection System Using Genetic Algorithms Maciej A. Mazurowski, Student Member, IEEE, Piotr A. Habas, Student Member, IEEE, Jacek M. Zurada, Fellow, IEEE, Georgia D. Tourassi, Member, IEEE Abstract—A knowledge-based computer assisted decision (KB-CAD) system is a case-based reasoning system previ- ously proposed for breast cancer detection. Although it was demonstrated to be very effective for the diagnostic problem, it was also shown to be computationally expensive due to the use of mutual information between images as a similarity measure. Here, the authors propose to alleviate this drawback by reducing the case-base size. The problem is formalized and a genetic algorithm is utilized as an optimization tool. Appropriate for the problem representation and operators are presented and discussed. A clinically relevant index of the area under the receiver operator characteristic curve is used as a measure of the system performance during the optimization and testing stages. Experimental results show that application of the proposed method can significantly reduce the case-base size while the classification performance of the KB-CAD, in fact, increases. I. I NTRODUCTION Computer assisted decision (CAD) systems have been demonstrated to be very efficient [1], [2], particularly in breast cancer detection and diagnosis [3], [4]. Such systems assist a radiologist in analysis of mammograms (x-ray images of a breast) and making a decision regarding the treatment of a patient. There are two general phases in the process of constructing and running a CAD system. The first phase is the image processing. In this phase, some preprocessing can be applied, the suspicious regions on the mammograms can be found and features of the analyzed image can be extracted. In the second phase, the decision is made regarding the query mammogram based on the information provided by the image processing steps. In this study, the authors focus only on the second phase. There are two main types of CAD systems depending on the decision algorithm: rule-based systems and case-based systems. The rule-based systems attempt to extract rules from the training samples and use these rules for the classification Maciej A. Mazurowski is with the Department of Electrical and Computer Engineering, University of Louisville, Louisville, KY 40292, USA (phone: 1 502 852 3165; e-mail: m.mazurowski@ieee.org). Piotr A. Habas is with the Department of Electrical and Computer Engineering, University of Louisville, Louisville, KY 40292, USA (phone: 1 502 852 3165; e-mail: piotr.habas@louisville.edu). Georgia D. Tourassi is with the Department of Radiology, Duke Univer- sity Medical Center, Durham, NC 27705 and with the Department of Elec- trical and Computer Engineering, University of Louisville, Louisville, KY 40292, USA (phone: 1 919 684 1447; e-mail: georgia.tourassi@duke.edu). Jacek M. Zurada is with the Department of Electrical and Computer Engineering, University of Louisville, Louisville, KY 40292, USA (phone: 1 502 852 6314; e-mail: jacek.zurada@louisville.edu). of the future cases. Examples of such systems are artificial neural networks [5] and Bayesian classifiers [6]. The other type are the systems implementing case-based reasoning [7]. In such systems, examples are stored in the database of the system. When a new query case comes, it is compared to selected or all the stored examples and the decision is made based on the results of such comparisons. In [8], Chang et al. present a case-based system where the comparison between mammographic images is performed using a set of features extracted from the images. In [9] Tourassi et al. propose the knowledge-based com- puter aided decision (KB-CAD) system where the compar- ison between images is made using the information theo- retic approach, namely the mutual information. The main advantage of such approach is that there is no feature extraction necessary. On the other hand, however, calculating a mutual information between two images is computationally expensive. This in turn means slow response of the system when new query is presented (mutual information indices between the query image and each of the images in the database have to be calculated). This problem can be alleviated, however, by reducing the image database of the KB-CAD by removing the irrelevant images. Three parameters of the case-based system can be optimized in this way. The first parameter is the time of the system classifying the query. In the KB-CAD when the query image is presented to the system, the mutual information between the query image and all the images in the database has to be calculated. This process is very time consuming. Removing unnecessary images from the database decreases this time. The second parameter is the storage requirements. Removing images from the database can obviously decrease the space required for storage of the database. The last parameter of a system that can be optimized by a reduction of the case-base is its classification accuracy. Such improvement may be obtained by removing misleading examples from the database. Multiple algorithms have been proposed for the case-base reduction (for the overview, see [10]). Here, the authors propose to use an algorithm that evaluates the subsets of the entire sample dataset using the clinically relevant criterion, namely the receiver operator characteristic (ROC) of the KB- CAD system, in order to obtain the best possible performance for the given sample dataset size. In order to perform the optimization, a genetic algorithm [11], [12] is used. Evolu- tionary algorithms have been previously applied to the case- based selection problem [13], [14]. To the knowledge of the 600 1-4244-1340-0/07/$25.00 c 2007 IEEE