Toward an Understanding of Situational Judgment Item Validity and
Group Differences
Michael A. McDaniel
Virginia Commonwealth University
Joseph Psotka and Peter J. Legree
U.S. Army Research Institute for the Behavioral and Social
Sciences
Amy Powell Yost
Capital One
Jeff A. Weekley
Kenexa
This paper evaluates 2 adjustments to common scoring approaches for situational judgment tests (SJTs). These
adjustments can result in substantial improvements to item validity, reductions in mean racial differences, and
resistance to coaching designed to improve scores. The first adjustment, applicable to SJTs that use Likert
scales, controls for elevation and scatter (Cronbach & Gleser, 1953). This adjustment improves item validity.
Also, because there is a White–Black mean difference in the preference for extreme responses on Likert scales
(Bachman & O’Malley, 1984), these adjustments substantially reduce White–Black mean score differences.
Furthermore, this adjustment often eliminates the score elevation associated with the coaching strategy of
avoiding extreme responses (Cullen, Sackett, & Lievens, 2006). Item validity is shown to have a U-shaped
relationship with item means. This holds both for SJTs with Likert score response formats and for SJTs where
respondents identify the best and worst response option. Given the U-shaped relationship, the second
adjustment is to drop items with midrange item means. This permits the SJT to be shortened, sometimes
dramatically, without necessarily harming validity.
Keywords: situational judgment test, extreme responding, racial differences, validity
Situational judgment tests (SJTs) present job applicants with
written or video-based problem scenarios and a set of possible
response options. Job applicants evaluate the effectiveness of the
responses for addressing the problem described in the scenario.
Although SJTs have been used in personnel selection for about
eighty years (McDaniel, Morgeson, Finnegan, Campion & Braver-
man, 2001; Moss, 1926), there is very little research addressing
how to best build and score SJTs (Schmitt & Chan, 2006; Week-
ley, Ployhart, & Holtz, 2006). In the absence of this knowledge,
many approaches have evolved for developing and scoring SJTs
(Bergman, Drasgow, Donovan, Henning, & Juraska, 2006; Week-
ley et al., 2006), but the effectiveness of these methods for max-
imizing criterion-related validity is largely unknown.
Unlike those in cognitive ability or job knowledge tests, re-
sponse options in SJTs cannot easily be declared correct or incor-
rect. As such, items are typically scored with some form of
consensus judgment (Legree, Psotka, Tremble, & Bourne, 2005).
Typically, expert judges are asked to reach consensus concerning
which responses are preferred (Weekley & Ployhart, 2006). Con-
sensus may also be based on the responses of applicants, incum-
bents, supervisors of incumbents, or even customers. In such
applications, the means of the respondents are considered the
correct response (i.e., the test answer key).
Consensual scoring is a form of profile matching. One profile
consists of the means of the items collected from some group (e.g.,
experts). The other profile is the item responses of a job applicant.
A respondent’s score on a SJT using a Likert format response
scale is a function of the degree of match between the respon-
dent’s answers and the group means. Cronbach and Gleser
(1953) conceptualized profile matching with respect to eleva-
tion, scatter, and shape. Elevation is the mean of the items for
a respondent. Scatter reflects the magnitude of a respondent’s
score deviations from the respondent’s own mean. Legree
(1995; Legree et al., 2005) suggested controlling for elevation
and scatter. If one standardizes scores using a within-person z
transformation, all respondents would have the same mean (0)
and the same standard deviation (1) across items. This trans-
formation removes information related to elevation and scatter
from the scores, because all respondents have identical eleva-
tion and scatter. The remaining score information in the within-
person standardized scores is called shape. Cronbach and
Gleser argued that investigators should consider whether ele-
vation and scatter are important in their profile-matching ap-
This article was published Online First January 24, 2011.
Michael A. McDaniel, School of Business, Virginia Commonwealth
University; Joseph Psotka and Peter J. Legree, U.S. Army Research Insti-
tute for the Behavioral and Social Sciences, Arlington, Virginia; Amy
Powell Yost, Capital One, Tampa, Florida; Jeff A. Weekley, Kenexa,
Frisco, Texas.
This research was funded by the U. S. Army Research Institute for the
Behavioral and Social Sciences through Contract W91WAW-07-C-0013,
awarded to Work Skills First, Inc. The views, opinions, and/or findings
contained in this article are solely those of the authors and should not
be construed as an official Department of the Army or Department of
Defense position, policy, or decision, unless so designated by other
documentation.
Correspondence concerning this article should be addressed to Michael
A. McDaniel, School of Business, Virginia Commonwealth University,
301 West Main Street, P.O. Box 844000, Richmond, VA 23284-4000.
E-mail: mamcdani@vcu.edu
Journal of Applied Psychology © 2011 American Psychological Association
2011, Vol. 96, No. 2, 327–336 0021-9010/11/$12.00 DOI: 10.1037/a0021983
327