Generalisability: a key to unlock professional assessment Jim Crossley, 1 Helena Davies, 2 Gerry Humphris 3 & Brian Jolly 4 Context Reliability is deﬁned as the extent to which a result reﬂects all possible measurements of the same construct. It is an essential measurement characteristic. Unfortunately, there are few objective tests for the most important aspects of the professional role because they are complex and intangible. In addition, professional performance varies markedly from setting to setting and case to case. Both these factors threaten reliability. Aim This paper describes the classical approach to evaluating reliability and points out the limitations of this approach. It goes on to describe how generalisability theory solves many of these limitations. Conditions A G-study uses variance component analysis to measure the contributions that all relevant factors make to the result (observer, situation, case, assessee and their interactions). This information can be com- bined to reﬂect the reliability of a single observation as a reﬂection of all possible measurements – a true reﬂec- tion of reliability. It can also be used to estimate the reliability of a combined sample of several different observations, or to predict how many observations are required with different test formats to achieve a given level of reliability. Worked examples are used to illustrate the concepts. Keywords educational measurement ⁄ *standards; education, medical undergraduate ⁄ *standards; professional competence ⁄ *standards; observer variation; reproducibility of results. Medical Education 2002;36:972–978 Introduction This paper is the second in a series on professional assessment. The ﬁrst article highlighted the importance of well-designed assessment in regulation and training, and laid out the general principles of assessment methodology. Reliability was identiﬁed as an important measurement characteristic. Regulatory decisions are increasingly challenged because the tools on which they are based are not reliable. 1 The reliability of a given measurement is the extent to which it reﬂects all possible measurements of the same construct. Achieving reliability is a partic- ular challenge in professional assessment for two reasons: 1 The professional role is made up of complex behav- iours. Apparently ÔobjectiveÕ methods such as know- ledge tests do not reﬂect the richness of these behaviours; they lack authenticity. Equally, it is difﬁcult to reduce these behaviours to a checklist of observable processes. Attempts to measure them directly depend in part on subjective judgements about performance, and it is a substantial effort to measure and to ensure the reliability of such judge- ments. 2 Professional behaviour is highly dependent upon the nature and details of the problem being faced. The same attribute may be demonstrated more or less clearly in different settings or different test observa- tions (e.g. different cases). This is known as case- speciﬁcity. 2 In this paper we examine the classical approach to quantifying and controlling reliability, and show how generalisability theory offers an important extension to this approach. The basic principles of the theory are described here, but a newcomer wishing to apply the technique should refer to the cited texts or seek advice from a colleague with prior experience. The complexity of the theory is widely acknowledged, and there is limited experience amongst statisticians. 3 1 Department of Paediatric Medicine, Shefﬁeld Children’s Hospital, Shefﬁeld, UK, 2 Medical Education, Shefﬁeld Children’s Hospital, Shefﬁeld, UK, 3 University Department of Psychiatry, Manchester Royal Inﬁrmary, Manchester, UK, 4 Centre for Medical and Health Sciences Education, Faculty of Medicine, Monash University, Victoria, Australia Correspondence: Jim Crossley, Department of Paediatric Medicine, Shefﬁeld Children’s Hospital, Western Bank, Shefﬁeld S10 2TH, UK. E-mail: j.crossley@shefﬁeld.ac.uk The metric of medical education 972 Ó Blackwell Science Ltd MEDICAL EDUCATION 2002;36:972–978