What You Get Is What You See: Revisiting the Evaluator Effect in Usability Tests Morten Hertzum Computer Science, Roskilde University, Roskilde, Denmark, mhz@ruc.dk Rolf Molich DialogDesign, Stenløse, Denmark, molich@dialogdesign.dk Niels Ebbe Jacobsen Danish Consumer Council, Copenhagen, Denmark, niels.ebbe.jacobsen@gmail.com Abstract. Usability evaluation is essential to user-centred design, yet evaluators who analyse the same usability test sessions have been found to identify substantially different sets of usability problems. We revisit this evaluator effect by having 19 experienced usability professionals analyse video- recorded test sessions with five users. Nine participants analysed moderated sessions; ten participants analysed unmoderated sessions. For the moderated sessions, participants reported an average of 33% of the problems reported by all nine of these participants and 50% of the subset of problems reported as critical or serious by at least one participant. For the unmoderated sessions, the percentages were 32% and 40%. Thus, the evaluator effect was similar for moderated and unmoderated sessions, and it was substantial for the full set of problems and still present for the most severe problems. In addition, participants disagreed in their severity ratings. As much as 24% (moderated) and 30% (unmoderated) of the problems reported by multiple participants were rated as critical by one participant and minor by another. The majority of the participants perceived an evaluator effect when merging their individual findings into group evaluations. We discuss reasons for the evaluator effect and recommend ways of managing it. Keywords: usability evaluation, usability test, thinking-aloud test, evaluator effect, problem detection, severity rating 1 INTRODUCTION Evaluation is essential to the design of usable systems. This was recognised early by, for example, Lewis (1982) and has recently been reiterated by Siegel and Dray (2011). To conduct evaluations usability professionals need reliable and robust usability evaluation methods. A number of methods have been developed, including cognitive walkthrough (Wharton, Rieman, Lewis, & Polson, 1994), constructive interaction (O'Malley, Draper, & Riley, 1984), heuristic evaluation (Nielsen & Molich, 1990), metaphors of human thinking (Frøkjær & Hornbæk, 2008), and usability tests (Dumas & Redish, 1999). The usability test has long had a prominent position among these methods in that it is by some considered the single most important usability evaluation method (Gulliksen, Boivie, Persson, Hektor, & Herulf, 2004; Nielsen, 1993) and has been used as a yardstick for other usability evaluation methods (Bailey, Allan, & Raiello, 1992; John & Marks, 1997). The prominent position of the usability test warrants careful scrutiny of this method to understand its strengths and learn to stay within, or compensate for, its limitations. Behaviour & Information Technology, vol. 33, no. 2 (2014), pp. 143-161 Preprint version 1