Examining the Flynn Effect in the General Social Survey Vocabulary test using item response theory A. Alexander Beaujean a, * , Yanyan Sheng b a Baylor Psychometric Laboratory, Baylor University, Department of Educational Psychology, One Bear Place #97301, Waco, TX 76798-7301, USA b Southern Illinois University, Department of Educational Psychology & Special Education, Carbondale, IL 62901, USA article info Article history: Received 11 March 2009 Received in revised form 5 September 2009 Accepted 12 October 2009 Available online 11 November 2009 Keywords: Flynn Effect Item response theory General Social Survey abstract Most studies of the Flynn Effect (FE) use classical test theory (CTT)-derived scores, such as summed raw scores. In doing so, they cannot test competing hypotheses about FE, such as it is caused by a real change in cognitive ability versus it is a change in the tests that measure cognitive ability. An alternative to CTT- derived scores is to use latent variable scores, such as those from item response theory (IRT). This study examined the FE on the Vocabulary test in the General Social Survey using IRT. The results indicate that while there has been a decrease–increase trend since the 1970s, the IRT-based scores never differed from the 1970s comparison point more than would be expected from random fluctuation. In contrast, while the CTT-derived summed scores showed the same decrease–increase pattern, all comparisons among the time points and the 1980s group were outside a 95% confidence interval. Multiple reasons for these results are discussed, with the conclusion being there is a need for more multiple-time point studies of the FE using IRT. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction The Flynn Effect (FE) (i.e., rise in IQ scores in the 20th century; Flynn, 1984, 1987) has been an active area of inquiry over the past three decades (Daley, Whaley, Sigman, Espinosa, & Neumann, 2003; Kanaya, Scullin, & Ceci, 2003; Sanborn, Truscott, Phelps, & McDougal, 2003; Sundet, Barlaug, & Torjussen, 2004). Those who think the FE represents real change in cognitive ability have made multiple attempts to explain this rise, ranging from nutritional changes (Lynn, 2009), to curricular changes (Blair, Gamsonb, Thor- nec, & Bakerd, 2005), to heterosis (outbreeding; Mingroni, 2004). However, others argue that the FE does not represent a real change in cognitive ability. Instead, the FE is the result of various psycho- metric artifacts (i.e., the tests’ properties change over time, not the respondents; Brand, 1996; Wicherts et al., 2004). In actuality, the FE is likely a combination of multiple factors working concurrently converging (Jensen, 1998). One common thread in most FE research is the reliance on scores derived from classical test theory (CTT) (for exceptions, see Beaujean & Osterlind, 2008; Flieller, 1988; Wicherts et al., 2004). CTT is concerned with the estimation of a ‘‘true score” and the resulting statistical analysis uses a function of the summed raw scores to estimate this true score (Crocker & Algina, 1986). Analyzing CTT-derived scores to study the FE is unfortunate for multiple reasons (Borsboom, 2005), the most cogent being they cannot differentiate between the two very distinct and important hypotheses (Chan, 1998): the FE is the result of an increase in cog- nitive ability versus the FE is the result of the change of cognitive ability tests over time. In contrast to analyzing CTT-derived scores, latent variable analysis allows the investigator to differentiate between the man- ifest test scores and the trait(s) they are designed to measure. When the variables under investigation are individual test items (instead of summed scores), the latent variable model is called an item response theory (IRT) model. An IRT model specifies how an individual’s (latent) trait level and a specific test item relate, as well as the item set where the individual item resides (Baker & Kim, 2004). Whereas CTT focuses on examinees’ total test score, IRT focuses on both individual items and the examinees’ (latent) trait score. This crucial difference allows for two very useful prop- erties when examining the FE. First, IRT methods allow for non- equivalent groups equating (Zimowski, 2003). Consequently, even though groups may significantly differ on the trait a test is measur- ing, using an IRT model allows for the groups’ scores to be equated onto the same scale. Second, in IRT models the item parameters are not dependent on the ability of the examinees responding to the items and the examinee’s scores are not dependent on the specific test items. Thus, groups can differ widely on the trait a test is mea- suring, but the item parameters should be the same (within a lin- ear transformation). So, if two groups of examinees take the same test at different time points and there is a significant change in the 0191-8869/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.paid.2009.10.019 * Corresponding author. Tel.: +1 254 710 1548; fax: + 254 710 3265. E-mail address: Alex_Beaujean@Baylor.edu (A. Alexander Beaujean). Personality and Individual Differences 48 (2010) 294–298 Contents lists available at ScienceDirect Personality and Individual Differences journal homepage: www.elsevier.com/locate/paid