Journal of Applied Psychology 1999, Vol. 84, No. 4, 610-619 Copyright 1999 by the American Psychological Association, Inc. 0021-9010/99/S3.00 What Is the Shelf Life of a Test? The Effect of Time on the Psychometrics of a Cognitive Ability Test Battery Kim-Yin Chan University of Illinois at Urbana-Champaign and Ministry of Defense, Singapore Fritz Drasgow University of Illinois at Urbana-Champaign Linda L. Sawin Air Force Research Laboratory The psychometric stability of the Armed Services Vocational Aptitude Battery was studied with data collected at 5 points over a 16-year period using item response theory (IRT) methods. Although 25 of the 200 items changed significantly over the years across 3 different gender-ethnic groups (i.e., White men, White women, and Black men), the overall charac- teristics of the tests were not severely affected by item-level changes. Items from tests that were more semantically laden were found to be more susceptible to the effects of time compared with those that focused on skills and principles. The findings are discussed in the context of the effects of time on the effectiveness of psychological measures. A call is made to test developers and test users to pay attention to the shelf life of their tests. The use of IRT methods for studying the effects of time on psychometrics is also discussed. The impact of time on the effectiveness of psychological measures has generally received little attention. Although Kim-Yin Chan, Department of Psychology, University of Illi- nois at Urbana-Champaign and the Ministry of Defense, Singa- pore; Fritz Drasgow, Department of Psychology, University of Illinois at Urbana-Champaign; Linda L. Sawin, Air Force Re- search Laboratory, Brooks Air Force Base, San Antonio, Texas. Linda L. Sawin is now at Aon Consulting, Grosse Point, Michigan. An earlier version of this article was presented at the annual conference of the Society for Industrial and Organizational Psy- chology, Dallas, Texas, April 1998. The study was conducted during a visit by Kim-Yin Chan to the Air Force Research Labo- ratory that was funded by the Singapore Ministry of Defense. The views expressed in this article are those of the authors and do not necessarily reflect the views of the Armstrong Laboratory, the U.S. Department of Defense, or the Singapore Ministry of Defense. We thank the U.S. Department of Defense for facilitating the research reported here. We are grateful to Terry Ackerman for his helpful suggestions. We also acknowledge the assistance of Rich Walker and Mary Beccera in the project. Correspondence concerning this article should be addressed either to Kim-Yin Chan, Applied Behavioral Sciences Department, Manpower Division—Ministry of Defense, Defense Technology Towers, Tower B #16-01, 5 Depot Road, Singapore 109681, Republic of Singapore or to Fritz Drasgow, Department of Psy- chology, University of Illinois at Urbana-Champaign, East Daniel Street, Champaign, Illinois 61820. Electronic mail may be sent to Kim-Yin Chan at kychan@starnet.gov.sg or to Fritz Drasgow at fdrasgow @ uiuc.edu. test developers usually try to examine the temporal stability of tests, such test-retest reliability studies are usually con- ducted,., over a short period of time with the hope that the meaning of test scores does not change significantly with time. Others such as Alvares and Hulin (1972) and Henry and Hulin (1987) have looked at the effects of time on the predictive validity of ability tests. These studies have es- sentially focused on the issue of the relational equivalence (Drasgow, 1984) of psychological measures over time with- out addressing the issue of the measurement equivalence of the instruments with time—that is, how do the psychometric characteristics of our measures change with time? In the field of applied psychology, much attention has been given to the effects of gender and ethnicity on the psychometric characteristics of psychological measures. So- phisticated methods using item response theory (IRT) have been developed in recent years to examine differential item functioning (DIP; see Holland & Wainer, 1993) and differ- ential test functioning (DTP; Raju, van der Linden, & Fleer, 1995) across groups, motivated by concerns that psycho- logical tests must not be biased when used in employment decisions. Although it is important to ensure that tests demonstrate measurement equivalence across groups, it is also vital that they do not change in their effectiveness over time for the various groups. We propose using the methods developed to study DIP and DTP over time. This is consis- tent with Angoff s (1988) call for the application of DIP methodology to a wide variety of important educational and psychological contexts including culture, time, geography, 610 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.