J Clin Epidemiol Vol. 50, No. 12, pp. 1327–1338, 1997 0895-4356/97/$17.00 Copyright  1997 Elsevier Science Inc. PII S0895-4356(97)00204-7 Design of a Study to Improve Accuracy in Reading Mammograms Margaret Sullivan Pepe, 1,* Nicole Urban, 1 Carolyn Rutter, 2 and Gary Longton 1 1 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington; and 2 Center For Health Studies, Group Health Cooperative, Seattle, Washington ABSTRACT. This paper is concerned with the design and analysis of mammography reading studies. In particu- lar we consider studies aimed at evaluating interventions to improve the accuracy with which mammograms are read. A simple randomized design is suggested in which a relatively large group of readers read sets of mammo- grams before and after an intervention phase. We propose solutions to three difﬁcult statistical issues that arise in the context of such studies: (i) the choice of primary outcome measure; (ii) the data analysis technique to be employed; and (iii) the methodology for calculating sample sizes for readers and images to be read. First, we argue in favor of using sensitivity and speciﬁcity as the primary outcome measures rather than receiver operating characteristic (ROC) curves in mammography studies, although the latter are considered state of the art for many types of radiology reading studies. We argue that sensitivity and speciﬁcity are more clinically relevant and conceptually more straightforward than ROC curves. Second, we suggest a bivariate approach to data analysis for evaluating intervention effects on sensitivity and speciﬁcity. This accommodates the correlations inherent between these measures and allows for estimation of joint effects on them. Finally we propose a method for power calculations that uses computer simulation techniques. Simple formulas for sample size calculations are not available in part because variability in accuracy amongst readers and variation in difﬁculty among images introduce complexity into power calculations. The simulation method that we propose accommodates such complexity and is easy to implement. The methodology was motivated by a study funded by the Department of Defense to evaluate the potential efﬁcacy of an educational intervention. In the context of this study we illustrate the steps involved in power calculations and apply the data analytic techniques to the sort of data expected to result from this study. Though the proposed methods were motivated by this particular study, the statistical considerations are relevant more broadly in mammography and indeed in other types of radiologic imaging studies. Standards for the conduct of radiologic reading studies are not yet well developed, as they are for randomized clinical trials and for case- control studies. We hope that the discussion in this paper will add to the dialogue necessary for development of such standards. j clin epidemiol 50;12:1327–1338, 1997.  1997 Elsevier Science Inc. KEY WORDS. ROC curves, sensitivity and speciﬁcity, computer simulation, diagnostic tests, screening 1. INTRODUCTION provements in the accuracy with which mammographers in- terpret mammograms may improve the performance of Mammography screening for breast cancer has been shown screening mammography. Recent studies [3,4] have shown to be associated with decreased breast cancer mortality, at that there is considerable variability amongst radiologists in least in women over the age of 50 years [1]. Major efforts their interpretations of screening mammograms. Elmore et are currently underway to improve participation by women al. [3] observed that sensitivities ranged from 74% to 96% in screening programs [2]. Nevertheless, there is concern and that speciﬁcities ranged from 35% to 89% among 10 about the quality of mammography screening and there is radiologists reading 150 selected mammograms. Beam et al. general agreement that improvements in quality may lead [4] using a much larger sample of 108 radiologists, each read- to improvements in the performance of mammography as ing 79 mammograms, found sensitivities in the range of 47– a screening modality. Quality might be improved for exam- 100% and speciﬁcities in the range of 35–99%. These obser- ple by improving the imaging procedures. Alternatively, im- vations suggest that improvement in interpretation may be possible. * Address for correspondence: Margaret Sullivan Pepe, Fred Hutchinson As part of a project called the Mammography Quality Cancer Research Center, Program in Biostatistics, 1124 Columbia Street, Improvement Project (MQIP) funded by the Department MP-665, Seattle, Washington 98104. Accepted for publication on 20 August 1997. of Defense and aimed at improving the quality of mammog-