Development of a Flexilevel Scale for use with computer- adaptive testing for assessing shoulder function Karon F. Cook, PhD, a,b,c Toni S. Roddey, PT, PhD, OCS, FAAOMPT, d Kimberly J. O’Malley, PhD, a,e,f and Gary M. Gartsman, MD, g Houston and Austin, TX In a 5-year study, a self-report measure of shoulder function—the Flexilevel Scale of Shoulder Function (FLEX-SF)—was developed by use of item response the- ory. A large pool of candidate items (N = 68) was developed. A questionnaire that included the 68 items, another scale of shoulder function, and clinical and demographic questions were administered to 400 per- sons with shoulder complaints. Patients’ responses to the 68 items were calibrated by use of Andrich’s rat- ing scale model. Thirty-three items were selected from the pool and subdivided into three overlapping testlets targeting low, medium, and high shoulder function. A table translates raw scores on testlets to a common mathematical metric. The validity and reliability of the FLEX-SF was evaluated in a longitudinal study of 199 patients. The FLEX-SF scores were highly reliable and exhibited excellent validity (including responsiveness). We report on a simulation of a computer-adaptive test of shoulder function. This simulation is based on the developmental items we tested for use in the FLEX-SF. The results indicate that greater measurement efﬁ- ciency can be achieved with a computer-adaptive test format. (J Shoulder Elbow Surg 2005;14:90S-94S.) I n recent years, interest in self-reported outcomes has increased substantially. The use of patients’ subjective judgments to evaluate health outcomes implies conﬁ- dence that the measures that elicit these self-reports are scientiﬁcally sound. Psychometrics is the science and mathematics that concerns itself with such issues. The development of an outcome measure by use of psychometric methods is a rigorous, expensive, and time-consuming project. The investment returns conﬁ- dence that the scale’s scores adequately and accu- rately portray the outcome of interest. Scientiﬁcally sound measurement is fundamental to excellence in research and clinical evaluations. Our work in evaluating the psychometric proper- ties of existing scales of self-reported shoulder out- come 4,6,20 convinced us of the need to develop a new measure. This report details how, in a 5-year study, psychometric methods were used to develop a self-report measure of shoulder function—the Flexi- level Scale of Shoulder Function (FLEX-SF). 5 We de- veloped this scale by use of item response theory (IRT), 8 a psychometric method that (1) accounts for differences in item difﬁculty and (2) supports Flexilevel Scales. A Flexilevel Scale is composed of 2 or more “testlets,” or subsets of items, that target respondents with different levels of the trait being measured. The FLEX-SF measures self-reported shoulder function. Pa- tients respond to an initial “routing item” that classiﬁes them as having low, medium, or high shoulder func- tion. They then respond only to the testlet that best targets their level of shoulder function. In addition to describing the development of the FLEX-SF, we report on a simulation of a computer- adaptive test (CAT) of shoulder function. This simula- tion is based on the developmental items we tested for use in the FLEX-SF. CAT-based outcome measures are more efﬁcient even than Flexilevel Scales. They hold substantial promise in the ﬁeld of outcomes research. ITEM POOL DEVELOPMENT A ﬁrst step in the psychometric method is to de- velop a large pool of items that could, potentially, be included in the ﬁnal measure. With a large pool of initial items, scale developers can be selective in choosing the best items for the measure. Scale devel- opers can gather potential items in 3 major ways: (1) adapt published items from other physical function scales, (2) write items based on input from an expert panel, and (3) develop items based on patient inter- views. We used each of these in developing the item pool for the FLEX-SF. Existing physical function scales From the a Houston Veterans Affairs Parkinson’s Disease Research and Educational Center, b Measurement Excellence and Training Resource Information Center (METRIC): A Veterans Affairs Health Services Research and Development Resource Center, c Baylor College of Medicine, d Texas Woman’s University, e Health Ser- vices Research and Development Center for Quality of Care and Utilization Studies, and g University of Texas School of Medicine, Houston, and f Pearson Educational Measurement, Austin. Reprint requests: Karon F. Cook, PhD, PADRECC, Houston VAMC (127-PD), 2002 Holcombe Blvd, Houston, TX 77030 (E-mail: karonc@bcm.tmc.edu). Copyright © 2005 by Journal of Shoulder and Elbow Surgery Board of Trustees. 1058-2746/2005/$30.00 doi:10.1016/j.jse.2004.09.024 90S