Issues in Assessing Multi-Institutional Performance of BI-RADS-based CAD Systems Mia K. Markey *a , Joseph Y. Lo b a Biomedical Informatics Lab, Dept. of Biomedical Engineering, The University of Texas at Austin b Duke Advanced Imaging Labs, Dept. of Radiology, Duke University Medical Center ABSTRACT The purpose of this study was to investigate factors that impact the generalization of breast cancer computer-aided diagnosis (CAD) systems that utilize the Breast Imaging Reporting and Data System (BI-RADS). Data sets from four institutions were analyzed: Duke University Medical Center, University of Pennsylvania Medical Center, Massachusetts General Hospital, and Wake Forest University. The latter two data sets are subsets of the Digital Database for Screening Mammography. Each data set consisted of descriptions of mammographic lesions according to the BI-RADS lexicon, patient age, and pathology status (benign/malignant). Models were developed to predict pathology status from the BI- RADS descriptors and the patient age. Comparisons between the models built on data from the different institutions were made in terms of empirical (non-parametric) receiver operating characteristic (ROC) curves. Results suggest that BI-RADS-based CAD systems focused on specific classes of lesions may be more generally applicable than models that cover several lesion types. However, better generalization was seen in terms of the area under the ROC curve than in the partial area index (>90% sensitivity). Previous studies have illustrated the challenges in translating a BI-RADS-based CAD system from one institution to another. This study provides new insights into possible approaches to improve the generalization of BI-RADS-based CAD systems. Keywords: Diagnosis, Computer-Assisted; Mammography; Breast Neoplasms; ROC Curve; Sensitivity and Specificity; Medical Informatics; Pattern Recognition; Calibration 1. INTRODUCTION Breast cancer is the most common cancer and the second leading cause of cancer deaths for American women [1]. Fortunately, early detection has been shown to increase treatment options and the survival rate [2]. Mammography, x-ray imaging of the breast, is the primary modality for both screening and diagnostic exams of the breast. Improvements are needed to both the sensitivity and specificity of mammography. To this end, computer-aided detection (CAD) and computer-aided diagnosis (CADx) methods have been investigated [3-6]. Several previous studies have explored the use of statistical and machine learning methods for predicting the pathology status (benign, malignant) of breast lesions from descriptions produced by mammographers [7-23]. Approaches based on the American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) [24] are particularly appealing since this lexicon is widely used. A prior study raised concerns about whether such predictive models developed on data from one institution are applicable to data obtained at another institution [15]. The purpose of this study was to investigate factors that impact the generalization of breast cancer CADx systems based on BI-RADS descriptors. 2. METHODS 2.1 Data Data sets from four institutions were analyzed in this study: Duke University Medical Center (“Duke”), University of Pennsylvania Medical Center (“Penn”), Massachusetts General Hospital (“MGH”), and Wake Forest University (“WFU”). The Duke and Penn data sets have been previously described [15]. The MGH and WFU data sets were extracted from the Digital Database for Screening Mammography [25, 26] (DDSM, http://marathon.csee.usf.edu/Mammography/Database.html ). Detail on the extraction process and the resulting MGH and WFU sets are provided on the author’s website (http://www.bme.utexase.edu/research/informatics ). Medical Imaging 2005: Image Processing, edited by J. Michael Fitzpatrick, Joseph M. Reinhardt, Proc. of SPIE Vol. 5747 (SPIE, Bellingham, WA, 2005) 1605-7422/05/$15 · doi: 10.1117/12.594706 858