Using repeaters for estimating
comparable scores
Michelle Liou*, Philip E. Cheng and Chieh-Jung Wu
Institute of Statistical Science, Academia Sinica, Taiwan, ROC
Multiple forms of a test for placement services and for licensure or certication
exams are developed for measuring the same competence and need to be scaled so
that observed scores on these forms are comparable. Conventionally, comparable
studies have used the common-item design to adjust selection-bias between
candidates taking different forms. In practice, the common-item scores may not
be immediately available for comparability studies; even when available, they can
be contaminated by non-random error due to test disclosure. On the other hand,
scores from candidates repeatedly taking more than one form of text can be used to
scale test forms to achieve comparability. This study considers comparability
studies using repeaters as an incomplete-data problem, that is, most candidates
have observed scores on one form and missing scores on the other, but repeaters
have both. We propose a general model for estimating score distributions for test
forms that would have observed if all takers had no missing scores. The score
distributions based on all candidates can then be used to nd comparable scores by
the equipercentile method. The model parameters are estimated by maximizing the
incomplete-data likelihood via an EM algorithm. The standard errors of comparable
scores are also derived under the proposed model. The use of repeaters for
establishing score comparability is investigated in an empirical study.
1. Introduction
Multiple forms of a test designed for placement services and for licensure or certication
exams are developed for measuring the same competence and need to be scaled so that
observed scores on these forms are comparable. We follow Marco, Abdel-Fattah & Baron
(1992) in using the term ‘comparable’ rather than ‘equated’ for scores on target tests that are
not necessarily equivalent forms. One typical example would be scaling scores on the
American College Testing Assessment and the College Board’s Scholastic Aptitude Test to
achieve comparability. Conventionally, the common-item design has been used to collect
data for comparability studies. The design allows selection-bias between candidates taking
different forms to be estimated from common-item scores such that the comparable scores so
determined will be less affected by group differences in ability. The chained equipercentile
and frequency estimation methods were developed for estimating comparable scores with
data collected under the common-item design (Angoff, 1984).
British Journal of Mathematical and Statistical Psychology (1999), 52, 273–284 Printed in Great Britain
© 1999 The British Psychological Society
273
* Requests for reprints should be addressed to Michelle Liou, Institute of Statistical Science, Academia Sinica, Taipei
115, Taiwan, ROC (e-mail: mliou@stat.sinica.edu.tw).