Abstracts / Journal of Clinical Epidemiology 55 (2002) 627–632 631 speed, cognitive function and oral steroid use (HR 0.58, 95% CI 0.35–0.97). Conclusion: Screening for osteoporosis was associated with 40% fewer incident hip fractures over 6 years, compared to usual pri- mary care. Although this study was not randomized, these data suggest that screening for osteoporosis maybe beneficial for com- munity-dwelling women and men over age 65. LIMITATIONS OF THE STATISTICS USED TO MEASURE RESPONSIVENESS IN TWO LONGITUDINAL COHORT STUDIES OF COGNITIVE FUNCTION AND THE UTILITY OF APPROACHES DERIVED FROM ITEM RESPONSE THEORY Crane PK , van Belle G University of Washington, Seattle, Washington Purpose: We are interested in measuring changes in cognitive function in longitudinal cohort studies. Specifically, we were inter- ested in the responsiveness of the Mini-Mental Status Exam (MMSE), or its ability to detect changes over time. We were un- able to find any previous published reports on the responsiveness of the MMSE. Responsiveness has been hailed as the principal cri- terion for determining the utility of an instrument for assessing change in the context of evaluative studies such as clinical trials or longitudinal cohort studies. Previous statistics that have been ad- vocated to measure responsiveness have implicitly assumed a con- stant variance. The variance structure can be empirically deter- mined and the assumption of constant variance can be tested. However, such determinations can only be carried out after the fact, after the choice of scale has already been made and the data are available to analyze. Methods: Item response theory (IRT) provided tools that proved useful in analyzing these data with respect to responsiveness. Spe- cifically, IRT enabled the direct visualization of the variance struc- ture across the entire spectrum of cognitive function as measured by the MMSE. Results: In our studies, we found that the assumption of constant variance did not hold for the MMSE, precluding our use of cur- rently available responsiveness statistics. Conclusions: Currently available statistics for assessing respon- siveness have significant limitations due to their assumption of constant variance. IRT allows the direct visualization of the ex- pected variance structure across the entire spectrum of the trait measured by the instrument. In our case, this visualization led us to discard existing techniques to measure the responsiveness of the MMSE. IRT should be used as a first step in assessing the respon- siveness of instruments in order to determine whether the assump- tion of constant variance is appropriate. WHAT IS “BIOCHEMICAL FAILURE” IN PROSTATE CANCER? Lagu TC , Wells CK, Penson D, Concato J Yale University School of Medicine, New Haven, CT; and VA Medical Centers, Seattle, WA and West Haven, CT Increases in prostate-specific antigen (PSA) are often used as a surrogate endpoint (e.g., for mortality) when evaluating potential benefits of therapy for cancer of the prostate (CaP). In this context, different definitions of “failure” have evolved for each type of treatment (e.g., surgery, radiation), leading to potential confusion when physicians and patients evaluate data on outcomes. Our pur- pose was to compare existing, and evaluate a new, definition(s) of PSA failure. Patients were from an ongoing study of 1320 men diagnosed with CaP in the New England Veterans Affairs Healthcare System during 1991–1995. To ensure an adequate spectrum of CaP, a ran- dom sample of 125 men with CaP who died were first identified, and 125 age-matched men with CaP who were still alive were se- lected from the same VA sites. Data regarding baseline character- istic and post-treatment PSA (4–9 year follow-up) were collected. Proportions of men with PSA failure and median time to PSA fail- ure were determined, regardless of actual therapy received, based on criteria a) after prostatectomy (any detectable PSA); b) after ra- diotherapy, as per the American Society of Therapeutic Radiation and Oncology (ASTRO) (three consecutive rises after a nadir PSA); and c) applying a threshold for PSA slope (1 ng/mL/year). Among 250 men, 12 (4.8%) were excluded (e.g., unavailable records). For the remaining 238 patients, surgical failure was evi- dent in 165 (69.3%), whereas ASTRO classified 86 (36.1%) as having failed; kappa statistic for concordance = 0.28 (95% confi- dence interval 0.17–0.38), indicating “poor” agreement. The me- dian time to PSA failure also varied widely: 4.5 months for surgery vs. 37.6 months per ASTRO. Failure based on PSA slope identi- fied 112 men (47.1%); the median time to failure (22.7 months) was also intermediate between the existing approaches. Currently used definitions of treatment failure in CaP can pro- duce substantially different results when assessing the same clini- cal phenomena. A new definition of PSA failure, based on slope of PSA, is easy to use and applicable regardless of type of treatment. This research can improve the ability of providers (and their pa- tients) to make informed and appropriate decisions regarding treat- ment for prostate cancer. IMPROVING THE SLOPE INDEX AS A DISCRIMINATION MEASUREMENT Karafa MT , Dawson NV Cleveland Clinic Foundation, Cleveland, OH; Center for Health Care Research and Policy, Case Western Reserve University at Metro Health Medical Center, Cleveland, OH Background & Purpose: It has been demonstrated that analysis of ROC area can be somewhat insensitive to observed differences in discrimination compared with the Normalized Discrimination Index (NDI) and the Slope Index (SI) (MDM 1999, p524). SI has the most direct relationship to the true probability separation in predictions, but it is unaffected by the variance of predictions. We examine 3 modifications to the SI and compare them with ROC area (in terms of Somer’s D) and NDI. Methods: To determine minimum and maximum detectable sepa- ration in mean predictions we examine three measures of discrimi- nation, the NDI, Somer’s D (a transformation of ROC area), and the SI, each on a 0.0 to 1.0 scale. In addition, to correct the SI for changes in the variance, we used adjustments based on the mis- classification rate (MSI), the “Noise” parameter (NSI) and the Pooled variance in predictions (VSI). We generated random datasets consisting of 50 events and 50 non-events. Each observa- tion was assigned a simulated predicted probability of event using small, moderate, and large variance distributions of known separa- tion. To examine separation, the distributions were gradually changed from a mean prediction of 0.5 for each group to 1.0 for events and 0.0 for non-events. For changes in variance, we used a fixed amount of separation and a very tight distribution of predictions,