Journal of Clinical Epidemiology 58 (2005) 902–908 Scoring based on item response theory did not alter the measurement ability of EORTC QLQ-C30 scales Morten Aa. Petersen a, * , Mogens Groenvold a,b , Neil Aaronson c , Elisabeth Brenne d , Peter Fayers e , Julie Damgaard Nielsen f , Mirjam Sprangers g , Jakob B. Bjorner h,i , for the European Organisation for Research and Treatment of Cancer Quality of Life Group a The research unit, Department of Palliative Medicine, Bispebjerg Hospital, Bispebjerg bakke 23, 2400 Copenhagen, Denmark b Institute of Public Health, University of Copenhagen, Oester Farimagsgade 5A, 1399 Copenhagen, Denmark c Division of Psychosocial Research & Epidemiology, The Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX Amsterdam, The Netherlands d Palliative Medicine Unit, Department of Oncology and Radiotherapy, Trondheim University Hospital, Olav Kyrres gate 17, 7006 Trondheim, Norway e Department of Public Health, Aberdeen University Medical School, King’s College, Aberdeen, AB24 3FK, UK f The Research Unit for General Practice, University of Aarhus, Vennelyst Blvd 9, 8000 Aarhus, Denmark g Department of Medical Psychology, Academic Medical Center, University of Amsterdam, Meibergdreef 15, 1105 AZ Amsterdam, The Netherlands h National Institute of Occupational Health, Lersoe Parkalle ´ 105, 2100 Copenhagen, Denmark i QualityMetric Incorporated, 640 George Washington Hwy, Lincoln, RI 02865, USA Accepted 14 February 2005 Abstract Background and Objectives: Most health-related quality-of-life questionnaires include multi-item scales. Scale scores are usually estimated as simple sums of the item scores. However, scoring procedures utilizing more information from the items might improve measurement abilities, and thereby reduce the needed sample sizes. We investigated whether item response theory (IRT)-based scoring improved the measurement abilities of the EORTC QLQ-C30 physical functioning, emotional functioning, and fatigue scales. Methods: Using a database of 13,010 subjects we estimated the relative validities of IRT scoring compared to sum scoring of the scales. Results: The mean relative validities were 1.04 (physical), 1.03 (emotional), and 0.97 (fatigue). None of these were significantly larger than 1. Thus, no gain in measurement abilities using IRT scoring was found for these scales. Possible explanations include that the items in the scales are not constructed for IRT scoring and that the scales are relatively short. Conclusion: IRT scoring of the three longest EORTC QLQ-C30 scales did not improve measurement abilities compared to the traditional sum scoring of the scales. 2005 Elsevier Inc. All rights reserved. Keywords: EORTC QLQ-C30; IRT scoring; Known-groups comparisons; Quality of life; Relative validity; Sum scoring 1. Introduction Self-report questionnaires are widely used for measuring health-related quality of life. The majority of such question- naires use multi-item scales for measuring the different as- pects (domains) of quality of life. Such scale scores are commonly estimated as simple sums (or means) of the item scores. Sum scores are simple to construct and interpret. However, more sophisticated scoring procedures utilizing more of the information in the items might improve the measurement abilities. Improving the measurement abili- ties means that the number of patients included in a study can be reduced without reducing the power to detect differences * Corresponding author. Tel.: (+45) 3531 2025; fax: (+45) 3531 2071. E-mail address: map01@bbh.hosp.dk (M.A. Petersen). 0895-4356/05/$ – see front matter 2005 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2005.02.008 between groups or within groups over time. Therefore, it is highly relevant to explore whether new scoring methods can improve the measurement abilities of a scale. In recent years there has been an increasing interest in the use of the item response theory (IRT) [1]. IRT-based scoring methods potentially use much more of the informa- tion in the items than simple sum scoring. In “the right settings,” that is, when the assumptions underlying the IRT methodology are fulfilled, the use of IRT is the optimal way to score a scale [2]. Furthermore, IRT has several other theoretic advantages that can be utilized when analysing multi-item scales, this includes the possibilities of estimat- ing missing responses, constructing individually tailored scales, and using adaptive testing [3,4]. However, IRT scor- ing is more complex than conventional scoring methods, and may be more difficult to apply. Therefore, before chang- ing the scoring of a scale it should be established whether