Using repeaters for estimating comparable scores Michelle Liou*, Philip E. Cheng and Chieh-Jung Wu Institute of Statistical Science, Academia Sinica, Taiwan, ROC Multiple forms of a test for placement services and for licensure or certication exams are developed for measuring the same competence and need to be scaled so that observed scores on these forms are comparable. Conventionally, comparable studies have used the common-item design to adjust selection-bias between candidates taking different forms. In practice, the common-item scores may not be immediately available for comparability studies; even when available, they can be contaminated by non-random error due to test disclosure. On the other hand, scores from candidates repeatedly taking more than one form of text can be used to scale test forms to achieve comparability. This study considers comparability studies using repeaters as an incomplete-data problem, that is, most candidates have observed scores on one form and missing scores on the other, but repeaters have both. We propose a general model for estimating score distributions for test forms that would have observed if all takers had no missing scores. The score distributions based on all candidates can then be used to nd comparable scores by the equipercentile method. The model parameters are estimated by maximizing the incomplete-data likelihood via an EM algorithm. The standard errors of comparable scores are also derived under the proposed model. The use of repeaters for establishing score comparability is investigated in an empirical study. 1. Introduction Multiple forms of a test designed for placement services and for licensure or certication exams are developed for measuring the same competence and need to be scaled so that observed scores on these forms are comparable. We follow Marco, Abdel-Fattah & Baron (1992) in using the term ‘comparable’ rather than ‘equated’ for scores on target tests that are not necessarily equivalent forms. One typical example would be scaling scores on the American College Testing Assessment and the College Board’s Scholastic Aptitude Test to achieve comparability. Conventionally, the common-item design has been used to collect data for comparability studies. The design allows selection-bias between candidates taking different forms to be estimated from common-item scores such that the comparable scores so determined will be less affected by group differences in ability. The chained equipercentile and frequency estimation methods were developed for estimating comparable scores with data collected under the common-item design (Angoff, 1984). British Journal of Mathematical and Statistical Psychology (1999), 52, 273–284 Printed in Great Britain © 1999 The British Psychological Society 273 * Requests for reprints should be addressed to Michelle Liou, Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, ROC (e-mail: mliou@stat.sinica.edu.tw).