Toward an Understanding of Situational Judgment Item Validity and Group Differences Michael A. McDaniel Virginia Commonwealth University Joseph Psotka and Peter J. Legree U.S. Army Research Institute for the Behavioral and Social Sciences Amy Powell Yost Capital One Jeff A. Weekley Kenexa This paper evaluates 2 adjustments to common scoring approaches for situational judgment tests (SJTs). These adjustments can result in substantial improvements to item validity, reductions in mean racial differences, and resistance to coaching designed to improve scores. The first adjustment, applicable to SJTs that use Likert scales, controls for elevation and scatter (Cronbach & Gleser, 1953). This adjustment improves item validity. Also, because there is a White–Black mean difference in the preference for extreme responses on Likert scales (Bachman & O’Malley, 1984), these adjustments substantially reduce White–Black mean score differences. Furthermore, this adjustment often eliminates the score elevation associated with the coaching strategy of avoiding extreme responses (Cullen, Sackett, & Lievens, 2006). Item validity is shown to have a U-shaped relationship with item means. This holds both for SJTs with Likert score response formats and for SJTs where respondents identify the best and worst response option. Given the U-shaped relationship, the second adjustment is to drop items with midrange item means. This permits the SJT to be shortened, sometimes dramatically, without necessarily harming validity. Keywords: situational judgment test, extreme responding, racial differences, validity Situational judgment tests (SJTs) present job applicants with written or video-based problem scenarios and a set of possible response options. Job applicants evaluate the effectiveness of the responses for addressing the problem described in the scenario. Although SJTs have been used in personnel selection for about eighty years (McDaniel, Morgeson, Finnegan, Campion & Braver- man, 2001; Moss, 1926), there is very little research addressing how to best build and score SJTs (Schmitt & Chan, 2006; Week- ley, Ployhart, & Holtz, 2006). In the absence of this knowledge, many approaches have evolved for developing and scoring SJTs (Bergman, Drasgow, Donovan, Henning, & Juraska, 2006; Week- ley et al., 2006), but the effectiveness of these methods for max- imizing criterion-related validity is largely unknown. Unlike those in cognitive ability or job knowledge tests, re- sponse options in SJTs cannot easily be declared correct or incor- rect. As such, items are typically scored with some form of consensus judgment (Legree, Psotka, Tremble, & Bourne, 2005). Typically, expert judges are asked to reach consensus concerning which responses are preferred (Weekley & Ployhart, 2006). Con- sensus may also be based on the responses of applicants, incum- bents, supervisors of incumbents, or even customers. In such applications, the means of the respondents are considered the correct response (i.e., the test answer key). Consensual scoring is a form of profile matching. One profile consists of the means of the items collected from some group (e.g., experts). The other profile is the item responses of a job applicant. A respondent’s score on a SJT using a Likert format response scale is a function of the degree of match between the respon- dent’s answers and the group means. Cronbach and Gleser (1953) conceptualized profile matching with respect to eleva- tion, scatter, and shape. Elevation is the mean of the items for a respondent. Scatter reflects the magnitude of a respondent’s score deviations from the respondent’s own mean. Legree (1995; Legree et al., 2005) suggested controlling for elevation and scatter. If one standardizes scores using a within-person z transformation, all respondents would have the same mean (0) and the same standard deviation (1) across items. This trans- formation removes information related to elevation and scatter from the scores, because all respondents have identical eleva- tion and scatter. The remaining score information in the within- person standardized scores is called shape. Cronbach and Gleser argued that investigators should consider whether ele- vation and scatter are important in their profile-matching ap- This article was published Online First January 24, 2011. Michael A. McDaniel, School of Business, Virginia Commonwealth University; Joseph Psotka and Peter J. Legree, U.S. Army Research Insti- tute for the Behavioral and Social Sciences, Arlington, Virginia; Amy Powell Yost, Capital One, Tampa, Florida; Jeff A. Weekley, Kenexa, Frisco, Texas. This research was funded by the U. S. Army Research Institute for the Behavioral and Social Sciences through Contract W91WAW-07-C-0013, awarded to Work Skills First, Inc. The views, opinions, and/or findings contained in this article are solely those of the authors and should not be construed as an official Department of the Army or Department of Defense position, policy, or decision, unless so designated by other documentation. Correspondence concerning this article should be addressed to Michael A. McDaniel, School of Business, Virginia Commonwealth University, 301 West Main Street, P.O. Box 844000, Richmond, VA 23284-4000. E-mail: mamcdani@vcu.edu Journal of Applied Psychology © 2011 American Psychological Association 2011, Vol. 96, No. 2, 327–336 0021-9010/11/$12.00 DOI: 10.1037/a0021983 327